flowchart: An R Package for Data Flowchart Generation

Pau Satorra; João Carmezim; Natàlia Pallarès; Kenneth A. Taylor; Cristian Tebé

doi:10.5334/jors.649

Full Article

(1) Overview

Introduction

Clinical research reporting guidelines, such as the Consolidated Standards of Reporting Trials (CONSORT) [1] and Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) [2], both address the need to clearly report (where applicable) the number of subjects included from the initial cohort of eligible or screened subjects until the final number of subjects analyzed. In particular, the report should include the number of subjects who were recruited, assessed for eligibility, confirmed as eligible, and subsequently included/enrolled/randomized to treatment, as well as the number of subjects who received the intended treatment (or exposure of interest) and those who completed follow-up and were analyzed for the primary outcome. Reasons should also be provided for non-participation or exclusion at each of the relevant study stages, and the flow should be documented separately for each group when relevant (e.g., for the treated and control groups after randomization; for exposed and unexposed groups in a cohort study). It is essential that the steps and numbers are clearly defined and that the process is transparent in order to ensure the reproducibility of the study and high quality of the reporting. The guidance from these reporting guidelines aligns with the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) guidelines, specifically E3 Structure and Content of Clinical Study Reports [3], E6 Good Clinical Practice [4], and E9 Statistical Principles for Clinical Trials [5].

The optimal method for illustrating participant flow through study stages as recommended in these guidelines is a flowchart (or flow diagram) composed of boxes and arrows. The number of subjects to be screened for eligibility and potential inclusion in the study, if known, is placed at the top of the flowchart. At each stage, the number of subjects excluded and the reason for exclusion are provided. Additionally, if applicable, the number of patients in each study group and the number of those that received intended treatment (or exposure) are also reported. Finally, the subjects included in the main analysis are reported, along with the number of exclusions due to a loss of follow-up or other reasons.

The accurate creation of these flowcharts is a joint task between the data management team and statisticians. It is time-consuming and labor-intensive, as every screened or recruited subject must be included, without exception. In a randomized trial setting, this process must be repeated many times until the database is closed for analysis. To facilitate this task, there are several R packages designed to draw flowcharts. However, most of these packages do not allow flexible designs or are difficult to program and reproduce, often requiring the manual input of numbers or parameterization. The Gmisc package [6], which is used behind our package, allows users to draw very complex flowcharts, but the programming can be challenging, and the numbers have to be entered manually or parameterized. The same applies to the DiagrammeR package [7], which uses a set of functions to draw flow diagrams based on the addition of nodes and edges. The CONSORT package [8] is mainly based on DiagrammeR and focuses on generating standard CONSORT diagrams. The resulting diagrams are fine, but they require considerable preprocessing to feed the function, and the package is limited to creating diagrams designed specifically for randomized controlled clinical trials. Finally, the ggflowchart package [9] allows users to plot flowcharts using the capabilities of the ggplot2 package [10], but it is primarily focused on generating decision trees rather than subject flow diagrams. Therefore, we have developed an R package called flowchart [11] to draw flowcharts in an easy and reproducible way within the R workflow.

Implementation and Architecture

The flowchart package is implemented entirely in R and consists of eight main functions, each one designed for a different purpose. These functions are intended to be used within a tidyverse workflow [12] by combining them with the so-called pipe operator, which can be either the magrittr package [13] pipe (%>%) or the base pipe (|>). There are two structural functions that must always be used in the initial and final steps, respectively: as_fc() and fc_draw(). The first allows the user to initialize a dataset in the fc S3 object class created for this package, while the second allows the user to draw the created flowchart. The body of the flowchart can be built by combining two functions that perform two basic operations illustrated in the next section: fc_filter() and fc_split(). Finally, the created flowchart can be customized and enhanced using the fc_modify() function or combined with other flowcharts either horizontally or vertically using the fc_merge() and fc_stack() functions, respectively. Once the final flowchart is drawn, it can be exported to the desired image format (bitmap or vector) using the fc_export() function. These functions are programmed using other R packages. The main packages used for data management are dplyr [14], purrr [15], tidyr [16], and stringr [17]. To draw the flowchart using the fc_draw() function, the package relies on the Gmisc package [6].

The flowchart Package

Installation

It can be installed through CRAN repository by running:

How it works?

To illustrate the main features of the package with an example, we will use the built-in safo dataset, which is a randomly generated dataset from the SAFO clinical trial [18]. The data can be loaded by:

and the first rows and most important columns of the database are displayed in Table 1.

Table 1

First three rows and most important columns of the SAFO dataset.

id	inclusion_crit	exclusion_crit	decline_part	group	itt	pp
1	Yes	No
2	No	No	Yes
3	No	No	No	cloxacillin plus fosfomycin	Yes	Yes

After loading the package, the first step is to initialize the flowchart from this dataset using the as_fc() function:

This creates an S3 object of class fc, which is structured as a list containing the dataset and the information related to the generated flowchart:

The object created in this first step contains in $data the entire SAFO dataset, since we haven’t done any operation, and in $fc the data frame with the information of the generated flowchart in this initial step, which is the first initial box indicating the total number of patients assessed for eligibility in the SAFO trial. With the label argument, we edited the label displayed in the box. From an fc object, we can always draw the associated flowchart using the function fc_draw():

The result of drawing the initialized flowchart is shown in Figure 1a. Now, we create the rest of the boxes in the flowchart, combining the desired functions with the already initialized fc object, until the entire flowchart is built.

Drawing of each step in creating the SAFO study flowchart. All counts and percentages shown in each box are calculated automatically from the underlying dataset. a) First initialized flowchart. b) Flowchart with the first filtered box. c) Flowchart with the first filtered box and split.

To create a second box showing the number of subjects excluded before randomization, we can add the function fc_filter(), with show_exc = TRUE. We can change the label of the created box with the label argument:

The resulting output will now contain in $data the data filtered only for the randomized patients and in $fc the information from the first box and this new filtered box together with the box showing the excluded patients. The result of drawing this flowchart is shown in Figure 1b.

Now, to split the flowchart into the study groups, we can add the fc_split() function:

The resulting output will contain in $data the same data but grouped by the variable used in the split and in $fc the previous flowchart information along with these two additional boxes, one for each group. The result of drawing this flowchart is shown in Figure 1c.

Finally, the intention-to-treat and per-protocol population boxes in the flowchart can be generated by calling the fc_filter() function twice:

This last operation results in the final flowchart shown in Figure 2.

Drawing of the final study flowchart. All counts and percentages shown in each box are calculated automatically from the underlying dataset.

So, we have created a flowchart by combining different operations at each step until the desired diagram is obtained. All the code used to obtain this flowchart is summarized below:

Furthermore, we could also use the fc_export() function to export the flowchart plot to the most popular image formats (png, jpeg, tiff, bmp, svg, and pdf) and adjust the width, height, and resolution.

Additional arguments can be passed to these functions to enhance and personalize the flowchart at each step. To apply the same formatting arguments to all functions at once, use the fc_theme() function. In the examples presented in this manuscript, fc_theme(text_fs = 20, text_fs_exc = 20) is used to adjust the text size for improved readability. The fc_modify() function can also be used to modify the flowchart parameters contained in $fc, allowing full customization. Users also have the option of combining different flowcharts horizontally or vertically using the fc_merge() and fc_stack() functions, providing even more flexibility.

Furthermore, if no dataset is available, or if the number of subjects in each step cannot be calculated from a variable in a dataset, the N argument can be used to manually specify the numbers to be displayed in the boxes in each step. This option is particularly useful when subject-level data are inaccessible and only aggregated data are available. Such situations commonly arise when the only information available is extracted from published articles or legacy reports, or when the raw source data are too large for R’s memory constraints (e.g., claims data starting with a cohort of several million). More information about these features and other examples, such as a more customized example that replicates the original flowchart reported in the SAFO paper, can be found on the package website [19].

Quality control

All functions in the flowchart package are systematically tested using the testthat package [20]. As of version 1.0.0, the test suite achieves 82% code coverage [21]. These tests ensure that the package behaves as expected across various scenarios, producing the correct outputs, warnings, or errors where appropriate. In addition, the test suite is incorporated into a continuous integration workflow via GitHub Actions, ensuring that all tests are automatically executed whenever the package is updated. Running the test() function from the devtools package [22] in the development environment shows that all of the implemented tests for each function are successful:

All test scripts are publicly available in the package’s GitHub repository.

(2) Availability

Operating system

The flowchart package is compatible with all operating systems that support R (>= 4.1.0).

Programming language

R 4.1.0 or higher.

Additional system requirements

None

Dependencies

Flowchart requires the following R packages:

Gmisc
grid,
tidyr,
dplyr (>= 1.1.0),
purrr,
stringr,
tibble,
tidyselect,
rlang,
grDevices,
cli

List of contributors

PS is the creator and maintainer of the flowchart package. The conceptual design of the package was done by PS, JC, NP, and CT. KT implemented new features and initiated the test suite for the package. The main text of the manuscript was written by PS and CT. All the authors of the paper commented on earlier versions of the manuscript and read and approved the final version.

Software location

Code repository

Name: GitHub
Identifier: https://github.com/bruigtp/flowchart
Licence: GPL (≥ 3)
Date published: 01/31/2024

Language

English

(3) Reuse potential

Although the flowchart package does not account for every possible scenario or study design, its flexibility, together with the ability to combine different flowcharts using the merge and stack functions, allows a wide range of study designs to be represented. While originally developed for clinical studies, its general framework makes it applicable to diverse research domains involving data analysis. The package is intended for use by biostatisticians, epidemiologists, clinical trial statisticians, data managers, and applied researchers working across a broad range of study settings.

Comprehensive documentation is provided through built-in vignettes and accompanying articles, which include detailed explanations, use cases, and reproducible examples. These resources enable users to quickly understand the functionality of the package and adapt it to their own research contexts.

Users can extend or modify the package through the GitHub repository, where the source code is fully accessible. All users are encouraged to submit suggestions, improvements, or bug reports via the issue tracker at https://github.com/bruigtp/flowchart/issues. Contributions can also be made through pull requests following standard open-source practices.

Acknowledgements

We thank the CERCA Programme/Generalitat de Catalunya for institutional support. We thank Dra S Grillo, Dr M Pujol, Dr J Carratalà of the Hospital Universitari de Bellvitge for providing an anonymized version of the SAFO study database for flowchart.

Competing Interests

The authors have no competing interests to declare.