ReDiag: An Interactive Research Tool to Address Common Misconceptions in Linear Regression Model Diagnostics

Perseverence Savieri; Kurt Barbé; Lara Stas

doi:10.5334/jors.553

Full Article

(1) Overview

Introduction

In linear regression, the accuracy and reliability of results depend on four main assumptions [1]. First, the relationship between the continuous predictor variable(s) and the mean of the outcome variable should be linear. Second, observations should be independent of each other. Third, the variance of the residuals should be constant across all the gathered observations. Fourth, the residuals should follow a normal distribution [2, 3].

Misconceptions about these assumptions are widespread, as highlighted by Ernst and Albers (2017) in their systematic review [4]. They found that many researchers incorrectly assume that the normality of the variables, rather than the residuals, is a necessary condition. Additionally, most studies lacked transparency in reporting assumption checks, reflecting an incomplete understanding of these concepts. Despite the central role of linear regression in health and biomedical research, recent evidence shows that only a minority of studies adequately assess its key assumptions, with many misapplying or failing to report diagnostics altogether, raising concerns about the validity of their conclusions and potential implications for patient care [5]. These issues underscore the need for accessible tools that offer practical guidance and better diagnostic habits. Hence, in this paper, we focus on addressing the commonly misunderstood assumptions of linear regression. The independence assumption is not highlighted in this study because (1) violations of independence are typically anticipated from experimental design. In contrast, (2) linearity violations lead to biased models, (3) departures from normality result in incorrect hypothesis testing, and (4) unequal variances among subjects generally result in a loss of power (i.e. from an applied statistics perspective). It is important to note that heteroscedasticity, although not directly biasing the model, can still affect the accuracy of standard errors, which impacts the validity of inference.

As mentioned above, researchers often wrongly assess the normality of the variables instead of evaluating the residuals of the regression model [6, 7, 8], which are the differences between the observed and fitted (predicted) outcomes [9, 10]. When the distribution of the residuals deviates from normality, statistical tests for regression coefficients and confidence intervals are biased [9], hindering evidence-based inference. The misconception that the normality assumption applies to the outcome and predictor variables is problematic because, in practice, these variables very likely do not follow a normal distribution. In contrast, the residuals can still exhibit a normal distribution. For example, binary predictor variables do not follow a normal distribution. Nonetheless, regression models that use binary predictors can still have normally distributed residuals, which allows for a reliable conclusion [11, 12, 13]. Thus, the normality and homoscedasticity assumptions combined specify that the residuals of the model should follow a normal distribution with a mean of zero and a (fixed) standard deviation.

There is still a concern about ‘why’ the linearity assumption is essential and ‘how’ to check and correct for its violation, as researchers sometimes falsely assume that linear regression can only model linear relationships [14, 15]. However, the linearity assumption examines how the model parameters are linearly related to the fitted values. In their article ‘Assumptions of Multiple Regression: Correcting Two Misconceptions,’ Williams et al. explain this misconception in detail. They illustrate how linearity is not necessarily a property of the variables themselves but instead of the parameters and estimates in the regression model [13]. They argue that determining how to verify this assumption and correct its violation is essential. When the model violates the linearity assumption, it fails to describe the linear relationship between the mean outcome and predictor variables, leading to invalid estimations [9].

Existing methodology

Understanding these assumptions presents different educational problems, especially when explaining their importance [16]. However, recent technological advancements have revolutionised how statistics is applied and taught [17, 18, 19, 20, 21, 22]. These advancements, as advocated by Chance et al. [23] and the Guidelines for Assessment and Instruction in Statistics Education (GAISE) [24], play a crucial role in deepening researchers’ understanding of statistical concepts [25] and serve as an effective medium for communicating complex ideas [26, 27]. Therefore, researchers have developed user-friendly applications to create an effective learning environment that helps non-statisticians understand statistical concepts through interactive outputs [28, 29]. However, the existing applications for regression analysis [30, 31, 32] do not allow researchers to use their data and provide no guidance on addressing violated regression assumptions. Nicholson et al. [33] demonstrated that integrating statistical modelling techniques into a user-friendly application can support these features, encouraging researchers to understand unfamiliar statistical methods. Consequently, there is a critical gap in the scientific literature for a supporting research tool that provides interactive plot outputs and allows users to use their data in regression models to guide the study of these assumptions.

This study makes a notable contribution to the literature by addressing the recommendations of Hughes et al. [34], who emphasised the significance of user-friendly applications supported by thorough documentation and features catering to users with diverse backgrounds and skill sets. We present ReDiag, an original Shiny app developed to assess linearity, normality and homoscedasticity assumptions in linear regression while addressing common misconceptions surrounding these assumptions. Researchers can upload their data or use example datasets to identify assumption violations and explore adjustments, making statistical diagnostics accessible even to those with no prior knowledge of R. Furthermore, detailed explanations and guidance are provided within the application.

Implementation and architecture

ReDiag leverages the Shiny package [35], a robust web application framework from RStudio [36], to create an intuitive and user-friendly experience. Although R [37] runs in the background and powers the analytical engine of the application, no knowledge of R is required. ReDiag bridges the gap between simplicity and statistical rigour by combining the user-friendliness of the Shiny interface with the power of R. Additionally, the app is reactive, meaning it responds instantly to the user’s input and updates plots and results in real-time. A complete list of the R packages that were used to implement the app (e.g. MASS [38], CAR [39], ggplot2 [40], dplyr [41]) is available in the source code and on GitHub as supplementary material.

The user interface has a navigation bar with four tabs: Home, Model, Manual, and Contact Us (see Figure 1). The Home tab provides general and background information about the application. In contrast, the Model tab contains the tools needed to fit the regression model. The Manual tab offers detailed instructions on how to utilise the application. The Contact Us tab contains the details of the developers for support or feedback.

**ReDiag graphical user interface**. Screenshot of the Home tab.

Illustrated Examples

A tutorial using two example datasets from published studies [42, 43] of behavioural tests and cage occupancy in mice is available on GitHub as supplementary material. To get a glimpse of the architecture of ReDiag, consider screenshots of the launched app in the following steps.

The Model tab consists of an input (sidebar) panel for the model’s inputs and an output (main) panel, where the outputs are displayed. The sidebar panel consists of four tabs, and the main panel comprises seven output tabs, which will be discussed in the following section (Figure 2).

**Model tab**. Screenshot of the data input sidebar and a preview of the data.

The first step in fitting the regression model is to load a dataset in the Data Input tab in the sidebar panel (see Figure 2). Users can choose to load one of the example datasets by selecting the desired dataset under the header “Choose example data.” Alternatively, the user must specify the file extension when importing their dataset. The main panel displays the loaded dataset under the View Data tab, and the Data Summary tab presents descriptive statistics. The data type of the variable can be changed by selecting the variable and the new data type and clicking the “Change data type” button. Additionally, under the Transform tab, users can centre variables by subtracting the mean, which improves the interpretability of the intercept term. To further enhance interpretability, especially for predictors measured in large or small units, scaling variables (e.g., per meaningful increments such as per 10 or 50 units) is recommended, facilitating clearer interpretation of regression coefficients.

After selecting the desired dataset, users must specify the regression model’s outcome and predictor variables in the sidebar panel’s Define Model tab (Figure 3). If the model contains interaction terms, they can be created and added by marking the checkboxes under the “Create and add terms(s)” button. Once the model is defined, users must click the “Run Analysis” button to fit the model. The Model Summary output tab shows the regression model summary.

**Model Summary tab**. Screenshot of the defined model and regression output.

The Linearity Assumption, Normality Assumption and Homoscedasticity Assumption tabs provide instructions on assessing the model diagnostics. The app includes educational text on how to check the model assumptions and offers researchers a comprehensive background in terminology. This information can be accessed by ticking the ‘Show background information’ checkbox. When fitting a multiple linear regression model, checking the linearity assumption of the regression model might include creating component-plus-residual (partial residual) plots [9, 10]. The Linearity Assumption: C+R Plots tab allows for easy creation and assessment of these plots.

Under “Acceptable Appearance of Plots,” these tabs display static residual plots depicting ideal scenarios when the assumptions are met, along with specific interpretations (Figure 4). Reactive plots, automatically adapted from the fitted regression model, will appear under the “Residual plots from your model” heading. Users can then examine the reactive residual plots from their models to verify the regression model assumptions by using the static plots’ interpretations.

**Normality Assumption tab**. Screenshot of the theoretical text describing the assumption, acceptable appearance of residuals and residuals from the fitted model.

If the residual plots show violated assumptions, ReDiag offers general transformation options in the Transform sidebar tab, including square, square root, cube, and exponential transformations. In addition, there is an option to apply Box-Cox power transformations, which can be accessed through a checkbox. A slider can be used to choose the best power, and the reactive QQ plot will depict the updated residuals (Figure 5). The Manual tab provides further information on the Box-Cox transformation technique. After applying transformations, users have the option to download the updated data. It is important to note that transformations can significantly affect the interpretation of regression coefficients and predictions; hence, users should carefully consider and interpret results in the context of the transformed scale or, in some cases, back-transform to the original scale for easier interpretation.

**Box-Cox transformations**. A screenshot of the interactive plot where the slider was utilised to search for the correct power transformation.

Users can download the summary of the analyses as a PDF, Word, or HTML file via the Download Report tab on the main panel. ReDiag also provides the R code used in the Shiny application to produce the graphical outputs. These codes, in turn, allow the sharing and reproduction of the analyses.

Quality control

The app was evaluated through usability testing on various web browsers (i.e, Google Chrome, Safari, Firefox) with students from the Master’s in Biomedical Sciences and Research Master’s in Gerontological Sciences programs at the Vrije Universiteit Brussel (VUB), as well as colleagues from the Biostatistics and Medical Informatics (BISI) research group. These feedback sessions helped identify and address several minor errors, user interface and server issues, such as reactivity concerns and methods not functioning as expected. One of the most significant outcomes was the introduction of optional checkboxes for background information, allowing users to tailor the app to their level of expertise. Additionally, error messages were implemented to prompt users when essential inputs were missing before running the models. Overall, the usability testing proved invaluable, significantly enhancing the app’s functionality and user experience.

(2) Availability

The web application is available online at https://zq9mvv-vub0square.shinyapps.io/ReDiag-research-tool/, and the source code is available at https://github.com/vub-square/ReDiag-Shiny-app. Alternatively, users can download the source code directly from GitHub and run the application locally in R using the command shiny::runApp() after setting the working directory to the downloaded source code location.

Operating system

Windows, macOS, Linux

Programming language

R and HTML/CSS for UI

Additional system requirements

None.

Dependencies

None.

List of contributors

Perseverence Savieri, project lead, researcher, and developer

Kurt Barbé project mentor
Lara Stas project mentor

Software location

Code repository

Name: GitHub
Identifier: https://github.com/vub-square/ReDiag-Shiny-app
Licence: MIT
Date published: 20/12/24

Language

English

(3) Reuse potential

ReDiag is a versatile and freely accessible online tool designed to assess key linear regression assumptions: normality, homoscedasticity, and linearity, making it invaluable for researchers across various fields. Its primary audience includes biomedical scientists, non-statisticians, and academics who need to evaluate regression assumptions but may lack advanced statistical expertise. ReDiag’s intuitive interface and interactive visualisations make it suitable for diverse use cases, including teaching the importance of regression assumptions, ensuring quality assurance in research projects, and evaluating datasets in fields such as social sciences, economics, and environmental studies. By allowing users to upload their data, customise regression models, and download plot outputs and reports, ReDiag promotes reproducibility and transparency in research.

Built using R’s Shiny framework, ReDiag combines user-friendly functionality with the analytical power of R. Advanced users can modify the app by accessing its reactive plotting code to tailor diagnostics, while beginners can learn about regression diagnostics through guided outputs. Contributors can extend the tool’s capabilities by accessing the source code via GitHub and implementing additional diagnostic tests, supporting other regression models, or enhancing visualisation techniques. As part of future work (long-term utility and maintenance), we plan to provide periodic updates to adapt to changes in R packages and enhance the app’s functionality. This includes potential extensions to other model families such as Poisson and Binomial regression. Researchers interested in collaborating are encouraged to reach out via the app’s Contact Us tab. By filling a critical gap in regression diagnostic tools, ReDiag empowers researchers to ensure robust, valid statistical analyses, advancing research quality across disciplines.

Acknowledgements

We thank Gamze Ates and Ann Massie, our colleagues from the Laboratory of Neuro-Ageing & Viro-Immunotherapy, Centre for Neurosciences (C4 N), Vrije Universiteit Brussel (VUB), for providing an additional example dataset.

Competing Interests

The authors have no competing interests to declare.