Statistical, machine learning, and deep learning models for COVID-19 forecasting in Kenya

Kiarie, Joyce; Mwalili, Samuel Musili; Mbogo, Rachel; Mutinda, John Kamwele; Langat, Amos Kipkorir

doi:10.1515/cmb-2025-0026

Full Article

Nomenclature

LSTM

long short-term memory

GRU

gated recurrent unit

RNN

recurrent neural network

ARIMA

autoregressive integrated moving average

SVR

support vector regression

RF

random forest

AIC

Akaike information criterion

BIC

Bayesian information criterion

PACF

partial autocorrelation function

ACF

autocorrelation function

RMSE

root mean square error

MAE

mean absolute error

MAPE

mean absolute percentage error

R ²

coefficient of determination

ML

machine learning

DL

deep learning

COVID-19

coronavirus disease 2019

1

Introduction

The coronavirus disease 2019 (COVID-19) pandemic, initiated by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus in late 2019, precipitated an unprecedented global health and economic crisis, with profound effects on diverse regions, including Africa [7,20,22,43]. In Kenya, the virus challenged a healthcare system already constrained by limited infrastructure, personnel, and financial resources, exacerbating vulnerabilities during multiple infection waves. The Kenyan government responded with measures such as lockdowns, testing campaigns, and vaccination drives, but their success hinged on accurate forecasting to inform timely resource allocation and policy interventions [17,30,31]. Accurate predictions of epidemic trends, including case numbers and mortality rates, are essential for optimizing healthcare preparedness, prioritizing interventions, and mitigating socioeconomic impacts [15,23,34,35,38]. Beyond Kenya, the pandemic underscored the global need for robust forecasting models to manage infectious diseases, particularly in low-resource settings where data and computational limitations pose significant challenges.

Epidemic forecasting leverages diverse modeling approaches to capture the complex dynamics of infectious diseases. Statistical models, such as the autoregressive integrated moving average (ARIMA), are valued for their simplicity and ability to model linear trends and seasonality in time series data [4,18,21]. However, their reliance on linear assumptions limits their effectiveness in capturing the nonlinear, volatile patterns typical of pandemics. Machine learning (ML) models, such as support vector regression (SVR) and random forest (RF), address these limitations by modeling complex, nonlinear relationships, offering improved predictive accuracy [12,14,25,29,32,33,36,44]. Deep learning (DL) models, including recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent units (GRU), excel in capturing temporal dependencies in sequential data, making them well suited for epidemic time series [1,2,42]. These advanced models have informed global health strategies, particularly in resource-limited settings where efficient forecasting can optimize scarce resources [24,28].

A study by Arora et al. [2] employed DL models, specifically RNN-based LSTM variants (deep LSTM, convolutional LSTM, and bi-directional LSTM), to predict COVID-19-positive cases across 32 Indian states and union territories, selecting the LSTM model with the lowest error for daily and weekly forecasts. Sinha et al. [40] forecasted COVID-19-confirmed cases in the USA, India, Brazil, Russia, and France, comparing artificial neural network (ANN) and RNN-based long LSTM models, using mean squared error for validation. The main finding was that LSTM outperformed ANN, indicating higher accuracy for epidemic predictions. These results suggest LSTM’s potential for informing timely public health interventions in highly affected countries. Ghafouri-Fard et al. [10] used ML and DL methods, including adaptive neuro-fuzzy inference system (ANFIS), LSTM, RNN, multilayer perceptron (MLP), and ARIMA, to predict COVID-19 case trends. It compares model performance using root-mean-squared error, mean absolute error (MAE), mean absolute percentage error (MAPE), and $R^{2}$ {R}^{2} , finding bidirectional LSTM, ANFIS, and MLP achieves high accuracy with $R^{2}$ {R}^{2} values near 1, while ARIMA and LSTM show higher MAPE. Ibrahim et al. [16] employed ANN, ANFIS, SVM, multiple linear regression (MLR), and ensemble approaches (ANN-Ensemble, SVM-Ensemble) to predict daily COVID-19 cases in ten African countries, addressing limited healthcare infrastructure. The findings showed that ANFIS outperformed standalone models, while ANN-Ensemble and SVM-Ensemble significantly improved accuracy, achieving up to 83% enhancement with $R^{2} = 0.9616$ {R}^{2}=0.9616 . Sujath et al. [41] utilized linear regression, MLP, and vector autoregression models to predict the epidemiological trend and rate of COVID-19 cases in India using Kaggle data. The findings forecasted potential patterns of confirmed, death, and recovered cases, providing insights into the disease’s near-future spread, Wang et al.’s [45] study applied ARIMA, SARIMA, and Prophet models to predict daily new and cumulative COVID-19 cases in the USA, Brazil, and India. The Prophet model excelled in forecasting daily new cases in the USA, capturing periodic trends, while ARIMA performed better for cumulative cases in Brazil and India, with performance evaluated via RMSE, MAE, and MAPE. These findings inform outbreak trends and support epidemiological control and policy formulation in affected regions. Xu et al. [46] developed convolutional neural network (CNN), LSTM, and CNN-LSTM models to forecast COVID-19 cases in Brazil, India, and Russia, addressing the global health crisis caused by SARS-CoV-2 mutations. The LSTM model achieved the highest forecasting accuracy, outperforming existing DL models, with performance comparisons showing significant improvements. Cumbane and Gidófalvi [9] developed a multi-layer BiLSTM model incorporating mobility, temperature, and humidity data to predict daily COVID-19 cases in low-income countries (Mozambique, Rwanda, Nepal, Myanmar) and other nations (Italy, Turkey, Australia, Brazil, Canada, Egypt, Japan, UK). The BiLSTM model outperformed multi-layer LSTM, ARIMA, and stacked LSTM models, achieving up to 1.6 times lower RMSE and 0.6 times lower average relative error. It also excelled at city-level forecasting in six Japanese regions (Tokyo, Aichi, Osaka, Hyogo, Kyoto, and Fukuoka) over seven 28-day periods.

Rustam et al. [37] applied ML models – linear regression (LR), least absolute shrinkage and selection operator (LASSO), support SVM, and exponential smoothing (ES) – to forecast COVID-19 cases, deaths, and recoveries over 10 days. ES outperformed other models, followed by LR and LASSO, while SVM showed poor performance across all prediction scenarios. The findings highlight the potential of ML, particularly ES, for effective COVID-19 forecasting to support decision-making.

This study contributes in epidemic forecasting by evaluating a range of forecasting models to predict COVID-19 trends in Kenya, focusing on total cases, critical cases, severe cases, and total deaths, using data from April 2020 to August 2021, to enhance epidemic preparedness in resource-constrained environments. The study conducts a comparative analysis of six models – ARIMA, SVR, RF, RNN, LSTM, and GRU – to forecast COVID-19 trends in Kenya, evaluating their performance across multiple dimensions to identify the most effective approach for public health applications.

The methodology employs an evaluation framework to compare model performance. Data from April 2020 to August 2021 are split into an 80:20 train-test ratio to ensure robust validation. Four evaluation metrics – RMSE, MAE, MAPE, and the coefficient of determination ( $R^{2}$ {R}^{2} ) – are used to assess predictive accuracy comprehensively. The Diebold-Mariano (DM) test is applied to statistically evaluate differences in forecasting errors across models, using MSE, MAE, and MAPE as loss functions, providing a robust measure of comparative performance. This study contributes to epidemic forecasting in the following ways:

Comparative evaluation of six models – ARIMA, SVR, RF, RNN, LSTM, and GRU – for COVID-19 forecasting in Kenya, using RMSE, MAE, MAPE, and $R^{2}$ {R}^{2} to identify RF as the most effective model for resource-constrained settings.
Application of the DM test with MSE, MAE, and MAPE as loss functions to confirm statistically significant differences in forecasting performance, highlighting RF’s superior accuracy.
Demonstration of RF’s effectiveness in predicting total cases, critical cases, severe cases, and total deaths, offering practical insights for public health forecasting.

This article is organized as follows: Section 2 details the dataset and methodology, including model specifications, evaluation metrics, and the DM test. Section 3 presents the results, including descriptive statistics, forecasting outcomes, and robustness checks, supported by visualizations such as time series plots and heatmaps. Section 4 discusses the findings, emphasizing model strengths and limitations. Section 5 concludes with key insights and implications for epidemic preparedness in Kenya and similar contexts. Section 6 addresses study limitations and suggests future research directions, such as incorporating external features and cross-validation.

2

Data and methods

2.1

Data

The dataset, obtained from the Ministry of Health website in Kenya, spans from April 15, 2020, to August 26, 2021, and includes four columns: total cases, severe cases, critical cases, and total deaths [27]. It provides daily records of these metrics, capturing the progression and impact of the COVID-19 pandemic. The analysis aims to model and predict these variables using time series forecasting and ML techniques to address the ongoing global health crisis.

2.1.1

ARIMA

ARIMA is a statistical model for forecasting univariate time series with autocorrelation and non-stationarity, integrating autoregressive (AR), differencing (I), and moving average (MA) components [13,19]. The AR component links current observations to past values, while the MA component models dependencies on past forecast errors. Differencing achieves stationarity. The ARIMA( $p, d, q$ p,d,q ) model, where $p$ p is the AR order, $d$ d is the differencing order, and $q$ q is the MA order, is expressed as follows: (1) $Δ^{d} Y_{t} = μ + \sum_{i = 1}^{p} ϕ_{i} Δ^{d} Y_{t - i} + ε_{t} + \sum_{j = 1}^{q} θ_{j} ε_{t - j},$ {\Delta }^{d}{Y}_{t}=\mu +\mathop{\sum }\limits_{i=1}^{p}{\phi }_{i}{\Delta }^{d}{Y}_{t-i}+{\varepsilon }_{t}+\mathop{\sum }\limits_{j=1}^{q}{\theta }_{j}{\varepsilon }_{t-j}, where $μ$ \mu is a constant, $ϕ_{i}$ {\phi }_{i} are the AR coefficients, $θ_{j}$ {\theta }_{j} are the MA coefficients, and $ε_{t}$ {\varepsilon }_{t} is the white noise. The autocorrelation function (ACF) and partial autocorrelation function (PACF) guide $p$ p and $q$ q selection, with PACF cutoffs indicating AR processes and ACF cutoffs suggesting MA processes.

2.1.2

SVR

SVR is an extension of the SVM framework for regression tasks. It aims to find a function $f (x)$ f\left(x) that deviates from the true target values $y_{i}$ {y}_{i} by no more than a small margin $ε$ \varepsilon , while keeping the model as flat as possible [3,11]. The function is defined as follows: (2) $f (x) = ⟨ w, x ⟩ + b,$ f\left(x)=\langle w,x\rangle +b, where $w$ w is the weight vector, $x$ x is the input vector, and $b$ b is the bias term. To enforce flatness, the norm $‖ w ‖^{2}$ \Vert w{\Vert }^{2} is minimized. This leads to the following primal optimization problem: (3) $\min_{w, b} \frac{1}{2} ‖ w ‖^{2} .$ \mathop{\min }\limits_{w,b}\frac{1}{2}\Vert w{\Vert }^{2}. subject to the constraint: (4) $∣ y_{i} - ⟨ w, x_{i} ⟩ - b ∣ \leq ε, \forall i = 1, \dots, n .$ | {y}_{i}-\langle w,{x}_{i}\rangle -b| \le \varepsilon ,\hspace{1.0em}\forall i=1,\ldots ,n.

To allow for infeasible constraints due to noise or outliers, slack variables $ξ_{i}, ξ_{i}^{*} \geq 0$ {\xi }_{i},{\xi }_{i}^{* }\ge 0 are introduced. The problem becomes (5) $\min_{w, b, ξ_{i}, ξ_{i}^{*}} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})$ \mathop{\min }\limits_{w,b,{\xi }_{i},{\xi }_{i}^{* }}\frac{1}{2}\Vert w{\Vert }^{2}+C\mathop{\sum }\limits_{i=1}^{n}\left({\xi }_{i}+{\xi }_{i}^{* }) subject to (6) $y_{i} - ⟨ w, x_{i} ⟩ - b \leq ε + ξ_{i},$ {y}_{i}-\langle w,{x}_{i}\rangle -b\le \varepsilon +{\xi }_{i}, (7) $⟨ w, x_{i} ⟩ + b - y_{i} \leq ε + ξ_{i}^{*},$ \langle w,{x}_{i}\rangle +b-{y}_{i}\le \varepsilon +{\xi }_{i}^{* }, (8) $ξ_{i}, ξ_{i}^{*} \geq 0 .$ {\xi }_{i},{\xi }_{i}^{* }\ge 0. Here, $C > 0$ C\gt 0 is a regularization parameter that balances the model complexity and the amount up to which deviations larger than $ε$ \varepsilon are tolerated.

Several commonly used kernels in SVR include the following:

Linear kernel: (9) $K (x, x') = ⟨ x, x' ⟩,$ K\left(x,x^{\prime} )=\langle x,x^{\prime} \rangle ,

Polynomial kernel: (10) $K (x, x') = {(⟨ x, x' ⟩ + c)}^{d},$ K\left(x,x^{\prime} )={\left(\langle x,x^{\prime} \rangle +c)}^{d},

Radial basis function (RBF) kernel: (11) $K (x, x') = \exp (- γ ‖ x - x' ‖^{2}),$ K\left(x,x^{\prime} )=\exp \left(-\gamma \Vert x-x^{\prime} {\Vert }^{2}), where $c$ c is a constant, $d$ d is the degree of the polynomial, and $γ$ \gamma controls the width of the RBF kernel.

Regularization through the parameter $C$ C controls the trade-off between model complexity and tolerance to errors. A large $C$ C assigns a higher penalty to errors, leading to less margin violation, while a small $C$ C allows for a more generalized model by tolerating larger deviations. The choice of kernel and regularization parameters significantly affects the SVR model’s ability to capture nonlinear relationships and its generalization performance.

2.1.3

RF

RF is an ensemble learning algorithm that builds multiple decision trees and aggregates their outputs to perform regression. It was introduced by Breiman [5] and is particularly effective in handling high-dimensional data, nonlinear relationships, and reducing overfitting.

In regression settings, the goal is to predict a continuous target variable $y \in R$ y\in {\mathbb{R}} based on an input vector $x \in R^{d}$ {\bf{x}}\in {{\mathbb{R}}}^{d} . RF achieves this by constructing a collection of $T$ T decision trees, each trained independently on a different subset of the training data. The final prediction for a new input $x$ {\bf{x}} is obtained by averaging the predictions from all individual trees: (12) $\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} h_{t} (x),$ \hat{y}=\frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}{h}_{t}\left({\bf{x}}), where $h_{t} (x)$ {h}_{t}\left({\bf{x}}) denotes the prediction made by the $t$ t th decision tree.

Each tree is trained on a bootstrap sample, which is created by randomly sampling (with replacement) from the original training dataset. This introduces variability among the trees. Additionally, during the tree-building process, RF introduces further randomness by selecting a random subset of features at each split rather than considering all features. This decorrelates the trees and enhances generalization.

At each node of a decision tree, the algorithm selects the best split point among the randomly chosen features based on a splitting criterion that minimizes prediction error. For regression, the most commonly used criterion is the mean-squared error between predicted and actual values in the resulting child nodes. This local optimization helps guide the recursive partitioning of the input space.

Unlike a single decision tree, which may overfit the training data, RF reduces the variance of predictions by averaging across many trees. The ensemble effect ensures that individual overfitting errors are averaged out, leading to more robust and accurate predictions.

2.1.4

RNN

RNNs are a class of neural architectures tailored for sequential data modeling. They maintain a hidden state $h_{t}$ {h}_{t} that evolves over time by incorporating information from both the current input $x_{t}$ {x}_{t} and the previous hidden state $h_{t - 1}$ {h}_{t-1} , thereby capturing temporal dependencies [26]. The hidden state is updated using the recurrence relation: (13) $h_{t} = σ (W_{h} h_{t - 1} + W_{x} x_{t} + b_{h}),$ {h}_{t}=\sigma \left({W}_{h}{h}_{t-1}+{W}_{x}{x}_{t}+{b}_{h}), where $W_{h}$ {W}_{h} and $W_{x}$ {W}_{x} are the weight matrices corresponding to the hidden state and the input, respectively, $b_{h}$ {b}_{h} is the bias term, and $σ$ \sigma is a nonlinear activation function such as tanh or ReLU. The output at each time step $y_{t}$ {y}_{t} is computed as follows: (14) $y_{t} = W_{y} h_{t} + b_{y},$ {y}_{t}={W}_{y}{h}_{t}+{b}_{y}, where $W_{y}$ {W}_{y} is the weight matrix for the output, and $b_{y}$ {b}_{y} is the output bias term. This recurrent structure, combined with the output transformation, allows the network to retain and propagate information through time steps, making it suitable for tasks involving time series or sequences.

2.1.5

LSTM

LSTM is a specialized type of RNN designed to capture long-range dependencies in sequential data by using gates to control the flow of information. Each LSTM cell contains three primary gates: the forget gate, input gate, and output gate, which regulate how information is retained, updated, and outputted at each time step [6].

The forget gate $f_{t}$ {f}_{t} determines what proportion of the previous cell state $C_{t - 1}$ {C}_{t-1} should be forgotten. It outputs values between 0 and 1, where 0 means “completely forget” and 1 means “completely retain.” The forget gate is calculated as follows: (15) $f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}),$ {f}_{t}=\sigma \left({W}_{f}{x}_{t}+{U}_{f}{h}_{t-1}+{b}_{f}), where $f_{t}$ {f}_{t} is the forget gate at time $t$ t , $W_{f}$ {W}_{f} , $U_{f}$ {U}_{f} , and $b_{f}$ {b}_{f} are the weights and bias for the forget gate, and $σ$ \sigma is the sigmoid activation function.

The input gate $i_{t}$ {i}_{t} determines how much new information should be added to the cell state $C_{t}$ {C}_{t} . It also uses a sigmoid function to output values between 0 and 1. The input gate is computed as follows: (16) $i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}),$ {i}_{t}=\sigma \left({W}_{i}{x}_{t}+{U}_{i}{h}_{t-1}+{b}_{i}), where $i_{t}$ {i}_{t} is the input gate at time $t$ t , $W_{i}$ {W}_{i} , $U_{i}$ {U}_{i} , and $b_{i}$ {b}_{i} are the weights and bias for the input gate.

The cell candidate $\tilde{C_{t}}$ \tilde{{C}_{t}} represents the new information that is to be added to the cell state. This candidate value is then filtered by the input gate $i_{t}$ {i}_{t} . The cell state $C_{t}$ {C}_{t} is updated as follows: (17) $C_{t} = f_{t} C_{t - 1} + i_{t} \tilde{C_{t}},$ {C}_{t}={f}_{t}{C}_{t-1}+{i}_{t}\tilde{{C}_{t}}, where $C_{t - 1}$ {C}_{t-1} is the previous cell state, $f_{t}$ {f}_{t} is the forget gate, $i_{t}$ {i}_{t} is the input gate, and $\tilde{C_{t}}$ \tilde{{C}_{t}} is the cell candidate.

The output gate $o_{t}$ {o}_{t} determines what part of the cell state should be output as the hidden state $h_{t}$ {h}_{t} . It acts as a filter to decide how much of the cell state should contribute to the next hidden state. The output gate is calculated as follows: (18) $o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) .$ {o}_{t}=\sigma \left({W}_{o}{x}_{t}+{U}_{o}{h}_{t-1}+{b}_{o}).

Finally, the hidden state $h_{t}$ {h}_{t} is updated as (19) $h_{t} = o_{t} \tanh (C_{t}),$ {h}_{t}={o}_{t}\tanh \left({C}_{t}), where $o_{t}$ {o}_{t} is the output gate, $C_{t}$ {C}_{t} is the cell state, and $\tanh$ \tanh is the hyperbolic tangent activation function, applied to the cell state.

Figure 1 presents the LSTM network, which incorporates memory cells to handle long-term dependencies effectively.

2.1.6

GRU

The GRU is a simpler version of LSTM with fewer parameters. It combines the forget and input gates of LSTM into a single update gate. The GRU has two main gates: the update gate and the reset gate [39].

The update gate $z_{t}$ {z}_{t} controls the extent to which the previous hidden state $h_{t - 1}$ {h}_{t-1} should be retained. A higher value means more of the previous state is retained, while a lower value means more of the current candidate hidden state $\tilde{h_{t}}$ \tilde{{h}_{t}} is used. The update gate is computed as (20) $z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z}),$ {z}_{t}=\sigma \left({W}_{z}{x}_{t}+{U}_{z}{h}_{t-1}+{b}_{z}), where $z_{t}$ {z}_{t} is the update gate, and $W_{z}$ {W}_{z} , $U_{z}$ {U}_{z} , and $b_{z}$ {b}_{z} are the weights and bias for the update gate.

The reset gate $r_{t}$ {r}_{t} determines how much of the previous hidden state should be used when computing the candidate hidden state. It is calculated as follows: (21) $r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r}),$ {r}_{t}=\sigma \left({W}_{r}{x}_{t}+{U}_{r}{h}_{t-1}+{b}_{r}), where $r_{t}$ {r}_{t} is the reset gate, and $W_{r}$ {W}_{r} , $U_{r}$ {U}_{r} , and $b_{r}$ {b}_{r} are the weights and bias for the reset gate.

The candidate hidden state $\tilde{h_{t}}$ \tilde{{h}_{t}} is computed using the reset gate to decide how much of the previous hidden state should be incorporated. It is calculated as follows: (22) $\tilde{h_{t}} = \tanh (W_{h} x_{t} + U_{h} (r_{t} \cdot h_{t - 1}) + b_{h}),$ \tilde{{h}_{t}}=\tanh \left({W}_{h}{x}_{t}+{U}_{h}\left({r}_{t}\cdot {h}_{t-1})+{b}_{h}), where $\tilde{h_{t}}$ \tilde{{h}_{t}} is the candidate hidden state and $r_{t} \cdot h_{t - 1}$ {r}_{t}\cdot {h}_{t-1} is the element-wise multiplication of the reset gate and the previous hidden state.

Finally, the hidden state $h_{t}$ {h}_{t} is updated as a linear interpolation between the previous hidden state $h_{t - 1}$ {h}_{t-1} and the candidate hidden state $\tilde{h_{t}}$ \tilde{{h}_{t}} , controlled by the update gate $z_{t}$ {z}_{t} : (23) $h_{t} = (1 - z_{t}) \cdot h_{t - 1} + z_{t} \cdot \tilde{h_{t}},$ {h}_{t}=\left(1-{z}_{t})\cdot {h}_{t-1}+{z}_{t}\cdot \tilde{{h}_{t}}, where $h_{t - 1}$ {h}_{t-1} is the previous hidden state, $\tilde{h_{t}}$ \tilde{{h}_{t}} is the candidate hidden state, and $z_{t}$ {z}_{t} is the update gate.

The structure of the GRU is illustrated in Figure 2, where gating mechanisms help regulate information flow and improve efficiency.

2.1.7

Evaluation metrics

The predictive performance of the models is evaluated using four standard metrics: RMSE, MAE, MAPE, and $R^{2}$ {R}^{2} . These metrics assess the accuracy of point forecasts in both absolute and relative terms. Additionally, to quantify the improvement of the best-performing model (RF) compared to other models, the percentage improvement for RMSE, MAE, and MAPE is calculated, providing a relative measure of performance enhancement: (24) $RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}},$ \hspace{0.1em}\text{RMSE}\hspace{0.1em}=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{\left({y}_{i}-{\hat{y}}_{i})}^{2}}, (25) $MAE = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - {\hat{y}}_{i} ∣,$ \hspace{0.1em}\text{MAE}\hspace{0.1em}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}| {y}_{i}-{\hat{y}}_{i}| , (26) $MAPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{y_{i} - {\hat{y}}_{i}}{y_{i}}∣,$ \hspace{0.1em}\text{MAPE}\hspace{0.1em}=\frac{100}{n}\mathop{\sum }\limits_{i=1}^{n}\left|\frac{{y}_{i}-{\hat{y}}_{i}}{{y}_{i}}\right|, (27) $R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},$ {R}^{2}=1-\frac{{\sum }_{i=1}^{n}{\left({y}_{i}-{\hat{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{\left({y}_{i}-\bar{y})}^{2}}, (28) $Percentage improvement = \frac{M_{other} - M_{RF}}{M_{other}} \times 100,$ \hspace{0.1em}\text{Percentage improvement}=\frac{{M}_{\text{other}}-{M}_{\text{RF}}}{{M}_{\text{other}}}\times 100, where $M_{other}$ {M}_{\text{other}} represents the RMSE, MAE, or MAPE of a competing model (ARIMA, SVR, RNN, LSTM, or GRU), and $M_{RF}$ {M}_{\text{RF}} is the corresponding metric for RF, the best-performing model.

To statistically evaluate whether the forecasting performance of two competing models differs significantly, the DM test is applied [8]. The test is based on the loss differential series $d_{t}$ {d}_{t} , defined as the difference in forecast losses from two models: (29) $d_{t} = L (e_{1 t}) - L (e_{2 t}),$ {d}_{t}=L\left({e}_{1t})-L\left({e}_{2t}), where $L (\cdot)$ L\left(\cdot ) is a loss function such as MSE, MAE, and MAPE, and $e_{1 t}$ {e}_{1t} , $e_{2 t}$ {e}_{2t} are the forecast errors at time $t$ t from the two models being compared.

The DM test evaluates the following hypotheses: (30) $H_{0} : E [d_{t}] = 0 (Equal predictive accuracy),$ {H}_{0}:{\mathbb{E}}\left[{d}_{t}]=0\hspace{1.0em}\hspace{0.1em}\text{(Equal predictive accuracy)}\hspace{0.1em}, (31) $H_{1} : E [d_{t}] \neq 0 (Unequal predictive accuracy) .$ {H}_{1}:{\mathbb{E}}\left[{d}_{t}]\ne 0\hspace{1.0em}\hspace{0.1em}\text{(Unequal predictive accuracy)}\hspace{0.1em}.

The test statistic is given by (32) $DM = \frac{\bar{d}}{\sqrt{{\hat{σ}}_{d}^{2} ∕ T}},$ {\rm{DM}}=\frac{\bar{d}}{\sqrt{{\hat{\sigma }}_{d}^{2}/T}}, where $\bar{d} = \frac{1}{T} \sum_{t = 1}^{T} d_{t}$ \bar{d}=\frac{1}{T}{\sum }_{t=1}^{T}{d}_{t} is the sample mean of the loss differential, $T$ T is the forecast horizon, and ${\hat{σ}}_{d}^{2}$ {\hat{\sigma }}_{d}^{2} is a consistent estimate of the variance of $d_{t}$ {d}_{t} .

Under the null hypothesis, the DM statistic asymptotically follows a standard normal distribution: (33) $DM \sim N (0, 1) .$ {\rm{DM}}\hspace{0.33em} \sim \hspace{0.33em}{\mathcal{N}}\left(0,1).

If the absolute value of the DM statistic exceeds the critical value from the standard normal distribution, the null hypothesis is rejected, indicating a statistically significant difference in predictive performance between the two models. In this study, the DM test is applied using MSE, MAE, and MAPE as loss functions to ensure comprehensive evaluation.

2.1.8

Experimental design

The experimental setup follows an 80:20 data split, where 80% of the time series is used for training and 20% for testing. We evaluate the forecasting performance of six models: ARIMA, SVR, RF, RNN, GRU, and LSTM. These models are applied to four key variables, with all learning carried out under a consistent framework.

To prepare the time series data for supervised learning for SVR, RF, and the RNNs, we apply a sliding window approach. 30-day sliding window is used to prepare the data. Given a univariate sequence $y_{1}, y_{2}, \dots, y_{t}$ {y}_{1},{y}_{2},\ldots ,{y}_{t} , each input-output pair is constructed as: (34) $X_{t} = (y_{t - w}, y_{t - w + 1}, \dots, y_{t - 1}), y_{t} = y_{t},$ {X}_{t}=\left({y}_{t-w},{y}_{t-w+1},\ldots ,{y}_{t-1}),\hspace{1.0em}{y}_{t}={y}_{t}, where $w$ w is the window size. This transformation allows the models to capture and learn temporal dependencies by treating past observations as features for predicting the next value.

For the DL models (RNN, GRU, LSTM), feature scaling is performed using Min-Max normalization to stabilize and accelerate the training process. The normalization formula is (35) $X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}} .$ {X}_{\text{scaled}}=\frac{X-{X}_{\text{min}}}{{X}_{\text{max}}-{X}_{\text{min}}}.

After prediction, the inverse transformation is applied to recover the original scale: (36) $X_{original} = X_{scaled} \times (X_{max} - X_{min}) + X_{min} .$ {X}_{\text{original}}={X}_{\text{scaled}}\times \left({X}_{\text{max}}-{X}_{\text{min}})+{X}_{\text{min}}.

Rescaling ensures that input features are within a uniform range, preventing dominance by features with larger numerical values. This enhances the convergence behavior of DL algorithms and contributes to better generalization performance.

2.1.9

Parameter settings

Hyperparameter tuning was carried out for RF, ARIMA, and SVR using grid search or automated routines, while DL models (RNN, LSTM, GRU) employed fixed configurations.

For ARIMA, the model parameters were automatically selected using the auto_arima package in Python, resulting in an optimal configuration of ARIMA(4,1,0) for all datasets. The differencing order $d = 1$ d=1 was used to make the time series stationary.

For SVR, hyperparameter tuning was performed using grid search, which explored the following parameter space:

$C \in [1, 10, 100]$ C\in \left[1,10,100] ,
$γ \in [1 \times 1 0^{- 6}, 1 \times 1 0^{- 5}, 1 \times 1 0^{- 4}]$ \gamma \in \left[1\times 1{0}^{-6},1\times 1{0}^{-5},1\times 1{0}^{-4}] ,
Kernel: RBF, linear and polynomial.

The optimal configuration for SVR was found to be

C = 10

C=10

, RBF kernel, and

γ = 1 \times 1 0^{- 6}

\gamma =1\times 1{0}^{-6}

for all time series variables.

For the RF model, hyperparameter tuning was performed using a grid search over a comprehensive parameter space to optimize predictive performance. The following hyperparameters were explored:

Number of trees ( $n_{estimators}$ {n}_{\text{estimators}} ): ${50, 100, 200}$ \left\{50,100,200\right\} ,
Maximum tree depth ( $max_depth$ \hspace{0.1em}\text{max\_depth}\hspace{0.1em} ): ${None, 10, 20, 30}$ \left\{\hspace{0.1em}\text{None}\hspace{0.1em},10,20,30\right\} ,
Minimum samples required to split a node ( $min_samples_split$ \hspace{0.1em}\text{min\_samples\_split}\hspace{0.1em} ): ${2, 5, 10}$ \left\{2,5,10\right\} ,
Minimum samples required at a leaf node ( $min_samples_leaf$ \hspace{0.1em}\text{min\_samples\_leaf}\hspace{0.1em} ): ${1, 2, 4}$ \left\{1,2,4\right\} ,
Maximum number of features considered for splits ( $max_features$ \hspace{0.1em}\text{max\_features}\hspace{0.1em} ): ${auto, \sqrt{max_features}}$ \left\{\hspace{0.1em}\text{auto},\sqrt{\text{max\_features}\hspace{0.1em}}\right\} ,
Splitting criterion ( $criterion$ \hspace{0.1em}\text{criterion}\hspace{0.1em} ): ${mse, mae}$ \left\{\hspace{0.1em}\text{mse},\text{mae}\hspace{0.1em}\right\} .

For the DL models , we used a fixed architecture with three layers and ReLU activation. The optimizer for all DL models was Adam with a learning rate of 0.001. Dropout was applied with a rate of 0.2 to prevent overfitting, and early stopping was used to halt training if the validation loss did not improve. The models were trained for 100 epochs with a batch size of 16.

The parameter settings for the DL models are summarized in Table 1. These settings ensure consistency across all models, enabling a fair comparison of their performance.

Table 1

Parameter settings for RNN, LSTM, and GRU models

Parameter	RNN	LSTM	GRU
Number of layers	3	3	3
Activation	ReLU	ReLU	ReLU
Loss function	MSE	MSE	MSE
Optimizer	Adam	Adam	Adam
Learning rate	0.001	0.001	0.001
Dropout rate	0.2	0.2	0.2
Epochs	100	100	100
Batch size	16	16	16
Units per layer	100, 50, 25	100, 50, 25	100, 50, 25
Early stopping	Yes (monitor = val_loss, patience = 10)	Yes (monitor = val_loss, patience = 10)	Yes (monitor = val_loss, patience = 10)

Explanation of parameters for RNN, LSTM, and GRU models:

Number of layers: All models have three layers of recurrent units (RNN, LSTM, or GRU). More layers allow the models to capture more complex patterns in sequential data.
Activation function: The activation function used in all models is ReLU (rectified linear unit), which helps mitigate the vanishing gradient problem and accelerates convergence.
Loss function: The MSE is used as the loss function for all models. It measures the average of the squares of the errors, making it sensitive to larger errors and ensuring that the model minimizes them.
Optimizer: Adam (Adaptive Moment Estimation) is used for optimization in all models. It adjusts the learning rate during training to improve convergence speed and accuracy.
Learning rate: Set to 0.001 for all models; this determines the step size during model training. A small learning rate helps with more stable training.
Dropout rate: Dropout of 0.2 is used to prevent overfitting. It randomly disables 20% of neurons in the network during training, forcing the model to learn more robust representations.
Epochs: All models are trained for 100 epochs to ensure enough iterations for convergence.
Batch size: A batch size of 16 is used, meaning 16 samples are processed at once during each training iteration.
Units per layer: The number of units in each layer is set to 100 for the first layer, 50 for the second, and 25 for the third layer, with the goal of progressively reducing the complexity of the model.
Early stopping: This mechanism halts training when the validation loss stops improving, preventing overfitting and unnecessary training.

The Python packages used in this study include Numpy for numerical operations and array handling, Pandas for data manipulation and analysis, Seaborn for data visualization, Matplotlib for creating static, animated, and interactive visualizations, Scikit-learn for ML models including SVR, RF, and performance metrics, Statsmodels for statistical models and time series analysis, Pmdarima for automatic ARIMA model selection, and Keras for building and training DL models.

3

Results and discussion

3.1

Descriptive statistics

Figure 3 presents the time series plots of four key variables: total cases, severe cases, critical cases, and total deaths. The patterns indicate an increasing trend over time with noticeable fluctuations. The presence of volatility suggests the need for further statistical analysis, such as stationarity tests and autocorrelation diagnostics.

Table 2 presents the summary statistics of the four key variables. The mean and standard deviation highlight high variability in the dataset. The skewness values indicate a slight positive skewness, suggesting that the distributions are slightly right-tailed. The negative kurtosis values suggest that the distributions are flatter than a normal distribution, indicating fewer extreme events.

Table 2

Summary statistics

	Count	Mean	Std Dev	Min	25%	50%	75%	Max	Kurtosis	Skewness
Total cases	499	8841.58	5911.12	146.05	3737.57	8218.49	12979.51	21393.05	$- 0.8752$ -0.8752	0.4086
Severe cases	499	4210.27	2814.82	69.55	1779.80	3913.57	6180.72	10187.17	$- 0.8752$ -0.8752	0.4086
Critical cases	499	1472.06	983.35	25.06	616.25	1359.01	2158.80	3570.59	$- 0.8753$ -0.8753	0.4116
Total deaths	499	588.83	393.34	10.03	246.50	543.60	863.52	1428.24	$- 0.8753$ -0.8753	0.4116

Table 3 shows the results of the augmented Dickey-Fuller (ADF) test for stationarity. Since the p-values are greater than 0.05, we fail to reject the null hypothesis, indicating that the time series data is non-stationary. This suggests the need for differencing or transformation before applying forecasting models.

Table 3

ADF test results for stationarity

Variable	ADF statistic	$p$ p -value
Total cases	$- 1.6906$ -1.6906	0.4360
Severe cases	$- 1.6906$ -1.6906	0.4360
Critical cases	$- 1.6968$ -1.6968	0.4327
Total deaths	$- 1.6968$ -1.6968	0.4328

Figure 4 displays the ACF and PACF plots for the four variables. The slow decay in the ACF plots suggests the presence of long memory in the time series. Long memory, also known as long-range dependence, implies that past observations significantly influence future values over a prolonged period.

3.2

Forecasting results

Table 4 summarizes the forecasting performance of six models, ARIMA, SVR, RNN, GRU, LSTM, and RF across four datasets: total cases, critical cases, severe cases, and total deaths, with metrics reported to four decimal places. RF consistently outperforms all models across all datasets, achieving the lowest RMSE, MAE, and MAPE, and the highest R ² values, nearing 1.0000. For total cases, RF yields an RMSE of 93.4117, MAE of 35.9370, MAPE of 0.2668, and R ² of 0.9995; for critical cases, RMSE of 17.5342, MAE of 7.3318, MAPE of 0.3330, and R ² of 0.9994; for severe cases, RMSE of 44.4818, MAE of 17.1128, MAPE of 0.2668, and R ² of 0.9995; and for total deaths, RMSE of 7.0137, MAE of 2.9327, MAPE of 0.3330, and R ² of 0.9994. These results highlight RF’s superior accuracy and robustness in capturing nonlinear patterns in COVID- 19 data.

Table 4

Performance metrics (RMSE, MAE, MAPE, and $R^{2}$ {R}^{2} ) for six models – ARIMA, SVR, RNN, LSTM, GRU, and RF – across four datasets: total cases, critical cases, severe cases, and total deaths

Dataset	Model	RMSE	MAE	MAPE (%)	$R^{2}$ {R}^{2}
Total cases	ARIMA	13870.6770	10916.6357	82.0056	$- 9.2272$ -9.2272
	SVR	5589.6811	3874.5758	26.4893	$- 0.6609$ -0.6609
	RNN	365.3872	289.7108	2.3229	0.9929
	LSTM	338.4482	281.7235	2.4740	0.9939
	GRU	215.3505	177.0998	1.5449	0.9975
	RF	93.4117	35.9370	0.2668	0.9995
Critical cases	ARIMA	2055.3009	1599.7675	70.9180	$- 6.7976$ -6.7976
	SVR	467.2761	216.0696	6.8270	0.5970
	RNN	62.5152	54.0835	2.8693	0.9928
	LSTM	185.1695	144.1515	6.6211	0.9367
	GRU	42.5913	35.1251	1.8999	0.9967
	RF	17.5342	7.3318	0.3330	0.9994
Severe cases	ARIMA	6593.2169	5188.9595	81.8556	$- 9.1904$ -9.1904
	SVR	2223.5193	1341.9956	17.5188	$- 0.1590$ -0.1590
	RNN	140. 1730	110.0014	2.3796	0.9954
	LSTM	281.8612	221.9638	3.9355	0.9814
	GRU	140.4860	113.2804	1.9507	0.9954
	RF	44.4818	17.1128	0.2668	0.9995
Total deaths	ARIMA	822.0758	639.8712	70.9140	$- 6.7967$ -6.7967
	SVR	89.1646	34.3085	2.6381	0.9083
	RNN	23.3445	17.9225	2.0940	0.9937
	LSTM	40.4583	31.6118	4.0341	0.9811
	GRU	21.3337	18.6039	2.5657	0.9947
	RF	7.0137	2.9327	0.3330	0.9994

All values are reported to four decimal places. Bold rows indicate the RF model, which consistently outperforms the other models.

ARIMA shows the weakest performance, with high error metrics and negative R ² values, indicating poor fit. For total cases, ARIMA’s RMSE is 13870.6770, MAE is 10916.6357, MAPE is 82.0056, and R ² is $- 9.2272$ -9.2272 , reflecting its struggle with nonlinear epidemic dynamics. SVR outperforms ARIMA but is inferior to RF, with higher errors, e.g., RMSE of 5589.6811, MAE of 3874.5758, MAPE of 26.4893, and R ² of $- 0.6609$ -0.6609 for total cases. Among DL models, GRU generally surpasses RNN and LSTM, notably for critical cases (RMSE: 42.5913, MAE: 35.1251, MAPE: 1.8999, R ²: 0.9967) and severe cases (RMSE: 140.4860, MAE: 113.2804, MAPE: 1.9507, R ²: 0.9954). However, RNN and LSTM exhibit higher errors and lower R ² compared to RF, with LSTM performing poorly for critical cases (RMSE: 185.1695, MAE: 144.1515, MAPE: 6.6211, R ²: 0.9367). While DL models capture some temporal dependencies, RF’s efficiency in handling feature interactions and nonlinearities without heavy computational demands.

RF’s consistent performance across datasets underscores its suitability for epidemic forecasting for this study where accuracy and efficiency are vital. Its low errors and near-perfect R ² values indicate strong generalization, ideal for public health applications. ARIMA and SVR’s limitations in complex data, and the moderate performance of RNN, GRU, and LSTM, suggest they are better suited for long-term temporal patterns. These findings support using ensemble methods like RF, with potential for future hybrid models combining RF and DL to improve accuracy.

Figure 5 illustrates the comparison between predicted and actual values for all four datasets. The alignment of predictions with actual values across these datasets demonstrates RF’s consistent predictive accuracy, with minimal deviations from actual values compared to other models. RF’s consistent performance across datasets underscores its suitability for epidemic forecasting in resource-constrained settings like Kenya, where accuracy and efficiency are vital. Its low errors and near-perfect R ² values indicate strong generalization.

Figure 6 presents a bar plot comparing the forecasting performance metrics of six models.

3.3

Quality checks

Figures 7, 8, 9 and 10 confirm that RF predictions exhibit a strong linear correlation with actual values across all datasets. The alignment of majority of points along the diagonal in each Scatterplot suggests that RF consistently provides accurate predictions with minimal deviations.

Figure 11 illustrates the distribution of forecasting errors for the total cases, critical cases, severe cases, and total deaths datasets. The plot compares the error distributions across six models demonstrating that RF consistently exhibits the lowest median error, minimal variance, and fewer outliers across all datasets. In contrast, models such as ARIMA and SVR show higher error variance and more outliers, while RNN, GRU, and LSTM display moderate performance, suggesting they are better suited for long-term temporal patterns. These findings highlight RF’s superior balance of accuracy and error minimization.

A heatmap, as shown in Figure 12, visually represents model performance through color gradients, clearly highlighting differences in RMSE, MAE, and MAPE across total cases, critical cases, severe cases, and total deaths datasets. All models demonstrate positive improvement over the RF baseline, with ARIMA exhibiting the highest enhancement.

3.4

Statistical robustness checks

As a robustness check, the DM test, as shown in Table 5, assesses the statistical significance of differences in forecasting errors between the best-performing model, RF, and benchmark models (ARIMA, SVR, RNN, LSTM, GRU) across total cases, critical cases, severe cases, and total deaths datasets, using MSE, MAE, and MAPE as loss functions. All comparisons are significant at the 1% level ( $p$ p -value $< 0.01$ \lt 0.01 ). Negative DM statistics indicate that RF’s errors are significantly smaller than those of the benchmark models, reflecting RF’s superior performance in most cases.

Table 5

DM test statistics comparing RF to benchmark models across datasets and loss functions

Dataset	Benchmark model	MSE	MAE	MAPE
Total cases	ARIMA	$- 8.381 3^{* * }$ -8.381{3}^{ * * }	$- 12.632 2^{* * }$ -12.632{2}^{ * * }	$- 14.587 6^{* * }$ -14.587{6}^{ * * }
	SVR	$- 5.986 5^{* * }$ -5.986{5}^{ * * }	$- 9.517 9^{* * }$ -9.517{9}^{ * * }	$- 14.060 0^{* * }$ -14.060{0}^{ * * }
	RNN	$- 7.478 2^{* * }$ -7.478{2}^{ * * }	$- 11.299 8^{* * }$ -11.299{8}^{ * * }	$- 13.647 8^{* * }$ -13.647{8}^{ * * }
	LSTM	$- 7.713 3^{* * }$ -7.713{3}^{ * * }	$- 13.057 4^{* * }$ -13.057{4}^{ * * }	$- 14.500 8^{* * }$ -14.500{8}^{ * * }
	GRU	$- 4.937 5^{* * }$ -4.937{5}^{ * * }	$- 9.653 0^{* * }$ -9.653{0}^{ * * }	$- 12.054 7^{* * }$ -12.054{7}^{ * * }
Critical cases	ARIMA	$- 8.196 2^{* * }$ -8.196{2}^{ * * }	$- 12.271 1^{* * }$ -12.271{1}^{ * * }	$- 16.908 7^{* * }$ -16.908{7}^{ * * }
	SVR	$- 4.138 3^{* * }$ -4.138{3}^{ * * }	$- 5.092 6^{* * }$ -5.092{6}^{ * * }	$- 5.328 4^{* * }$ -5.328{4}^{ * * }
	RNN	$- 8.906 4^{* * }$ -8.906{4}^{ * * }	$- 15.779 6^{* * }$ -15.779{6}^{ * * }	$- 17.316 7^{* * }$ -17.316{7}^{ * * }
	LSTM	$- 7.806 8^{* * }$ -7.806{8}^{ * * }	$- 12.091 7^{* * }$ -12.091{7}^{ * * }	$- 17.446 0^{* * }$ -17.446{0}^{ * * }
	GRU	$- 6.949 7^{* * }$ -6.949{7}^{ * * }	$- 10.959 0^{* * }$ -10.959{0}^{ * * }	$- 11.152 7^{* * }$ -11.152{7}^{ * * }
Severe cases	ARIMA	$- 8.380 6^{* * }$ -8.380{6}^{ * * }	$- 12.631 6^{* * }$ -12.631{6}^{ * * }	$- 14.564 6^{* * }$ -14.564{6}^{ * * }
	SVR	$- 5.235 0^{* * }$ -5.235{0}^{ * * }	$- 7.475 3^{* * }$ -7.475{3}^{ * * }	$- 9.483 4^{* * }$ -9.483{4}^{ * * }
	RNN	$- 5.296 3^{* * }$ -5.296{3}^{ * * }	$- 9.109 2^{* * }$ -9.109{2}^{ * * }	$- 9.459 2^{* * }$ -9.459{2}^{ * * }
	LSTM	$- 7.056 2^{* * }$ -7.056{2}^{ * * }	$- 11.344 0^{* * }$ -11.344{0}^{ * * }	$- 13.424 1^{* * }$ -13.424{1}^{ * * }
	GRU	$- 7.321 1^{* * }$ -7.321{1}^{ * * }	$- 11.440 3^{* * }$ -11.440{3}^{ * * }	$- 13.382 3^{* * }$ -13.382{3}^{ * * }
Total deaths	ARIMA	$- 8.196 2^{* * }$ -8.196{2}^{ * * }	$- 12.271 0^{* * }$ -12.271{0}^{ * * }	$- 13.478 9^{* * }$ -13.478{9}^{ * * }
	SVR	$- 3.243 0^{* * }$ -3.243{0}^{ * * }	$- 3.969 7^{* * }$ -3.969{7}^{ * * }	$- 4.008 1^{* * }$ -4.008{1}^{ * * }
	RNN	$- 7.122 7^{* * }$ -7.122{7}^{ * * }	$- 9.983 4^{* * }$ -9.983{4}^{ * * }	$- 11.479 3^{* * }$ -11.479{3}^{ * * }
	LSTM	$- 6.444 3^{* * }$ -6.444{3}^{ * * }	$- 10.958 3^{* * }$ -10.958{3}^{ * * }	$- 14.155 8^{* * }$ -14.155{8}^{ * * }
	GRU	$- 8.801 3^{* * }$ -8.801{3}^{ * * }	$- 13.753 9^{* * }$ -13.753{9}^{ * * }	$- 13.808 8^{* * }$ -13.808{8}^{ * * }

Note: $^{* * *}$ {}^{* * * } indicates statistical significance at the $1 %$ 1 \% level ( $p$ p -value $< 0.01$ \lt 0.01 ), $^{* *}$ {}^{* * } at the $5 %$ 5 \% level ( $p$ p -value $< 0.05$ \lt 0.05 ), $^{*}$ {}^{* } at the $10 %$ 10 \% level ( $p$ p -value $< 0.10$ \lt 0.10 ). Negative values indicate RF errors are significantly larger than the benchmark model’s errors.

4

Conclusion

The primary objective of this study was to evaluate the forecasting performance of statistical, ML, and DL models for predicting COVID-19 trends in Kenya, focusing on total cases, critical cases, severe cases, and total deaths. By comparing models such as ARIMA, SVR, RF, RNN, LSTM, and GRU, the study aimed to identify the most effective approach for epidemic forecasting in a resource-constrained setting. A robust evaluation framework, including multiple error metrics and the DM test, was employed to assess predictive accuracy and statistical significance of differences in forecasting errors. This comprehensive analysis sought to provide actionable insights for public health decision-making by determining which models best capture the complex dynamics of epidemic data in Kenya.

The key finding of this study is that ensemble ML methods, particularly RF, offer superior predictive accuracy and computational efficiency for COVID-19 forecasting in Kenya, making them highly suitable for resource-limited environments. While DL models such as GRU and LSTM show promise in capturing temporal dependencies, their performance is generally outshone by RF, which consistently excels across all datasets. In contrast, traditional statistical models like ARIMA struggle with the nonlinear patterns inherent in epidemic data, highlighting the advantage of ML approaches in such contexts. The DM test reinforces these findings by confirming significant differences in forecasting performance, with RF typically outperforming benchmarks, except in specific cases where other models show marginal advantages.

These findings underscore the potential of ML, particularly RF, to enhance epidemic preparedness, where efficient resource allocation is critical. The study advocates for the adoption of ensemble methods in public health forecasting while suggesting that future research explore hybrid models combining statistical and ML techniques to further improve accuracy. By leveraging such models, policymakers can make informed decisions to mitigate the impact of infectious diseases.

5

Limitations and recommendations for future work

This study, while comprehensive in evaluating six models for COVID-19 forecasting in Kenya, has several limitations. The analysis relied on a single 80:20 train-test split, focusing on one-step-ahead predictions, which may not fully capture the models’ performance in multi-step forecasting scenarios. Time-aware cross-validation, such as walk-forward validation, could provide a more robust assessment but was not implemented due to computational constraints. Additionally, the study was limited to six models, excluding simpler linear models or other advanced techniques, such as hybrid approaches or transformer-based architectures, which might offer complementary insights. The absence of external features, such as new COVID-19 variants, mobility patterns, or socioeconomic indicators, limits the models’ ability to account for real-world complexities. To address these limitations, future research should incorporate time-aware cross-validation to enhance model robustness and explore multi-step forecasting to better simulate real-world epidemic scenarios. Expanding the model space to include linear models, hybrid statistical-ML approaches, or transformer-based architectures, as suggested in recent studies , could improve predictive power. Incorporating external features, such as policy changes, mobility data, or socioeconomic factors, and real-time data streams, would provide a more holistic view of epidemic dynamics.

Dataset	Benchmark model	MSE	MAE	MAPE
Total cases	ARIMA	$- 8.381 3^{* * }$ -8.381{3}^{ * * }	$- 12.632 2^{* * }$ -12.632{2}^{ * * }	$- 14.587 6^{* * }$ -14.587{6}^{ * * }
	SVR	$- 5.986 5^{* * }$ -5.986{5}^{ * * }	$- 9.517 9^{* * }$ -9.517{9}^{ * * }	$- 14.060 0^{* * }$ -14.060{0}^{ * * }
	RNN	$- 7.478 2^{* * }$ -7.478{2}^{ * * }	$- 11.299 8^{* * }$ -11.299{8}^{ * * }	$- 13.647 8^{* * }$ -13.647{8}^{ * * }
	LSTM	$- 7.713 3^{* * }$ -7.713{3}^{ * * }	$- 13.057 4^{* * }$ -13.057{4}^{ * * }	$- 14.500 8^{* * }$ -14.500{8}^{ * * }
	GRU	$- 4.937 5^{* * }$ -4.937{5}^{ * * }	$- 9.653 0^{* * }$ -9.653{0}^{ * * }	$- 12.054 7^{* * }$ -12.054{7}^{ * * }
Critical cases	ARIMA	$- 8.196 2^{* * }$ -8.196{2}^{ * * }	$- 12.271 1^{* * }$ -12.271{1}^{ * * }	$- 16.908 7^{* * }$ -16.908{7}^{ * * }
	SVR	$- 4.138 3^{* * }$ -4.138{3}^{ * * }	$- 5.092 6^{* * }$ -5.092{6}^{ * * }	$- 5.328 4^{* * }$ -5.328{4}^{ * * }
	RNN	$- 8.906 4^{* * }$ -8.906{4}^{ * * }	$- 15.779 6^{* * }$ -15.779{6}^{ * * }	$- 17.316 7^{* * }$ -17.316{7}^{ * * }
	LSTM	$- 7.806 8^{* * }$ -7.806{8}^{ * * }	$- 12.091 7^{* * }$ -12.091{7}^{ * * }	$- 17.446 0^{* * }$ -17.446{0}^{ * * }
	GRU	$- 6.949 7^{* * }$ -6.949{7}^{ * * }	$- 10.959 0^{* * }$ -10.959{0}^{ * * }	$- 11.152 7^{* * }$ -11.152{7}^{ * * }
Severe cases	ARIMA	$- 8.380 6^{* * }$ -8.380{6}^{ * * }	$- 12.631 6^{* * }$ -12.631{6}^{ * * }	$- 14.564 6^{* * }$ -14.564{6}^{ * * }
	SVR	$- 5.235 0^{* * }$ -5.235{0}^{ * * }	$- 7.475 3^{* * }$ -7.475{3}^{ * * }	$- 9.483 4^{* * }$ -9.483{4}^{ * * }
	RNN	$- 5.296 3^{* * }$ -5.296{3}^{ * * }	$- 9.109 2^{* * }$ -9.109{2}^{ * * }	$- 9.459 2^{* * }$ -9.459{2}^{ * * }
	LSTM	$- 7.056 2^{* * }$ -7.056{2}^{ * * }	$- 11.344 0^{* * }$ -11.344{0}^{ * * }	$- 13.424 1^{* * }$ -13.424{1}^{ * * }
	GRU	$- 7.321 1^{* * }$ -7.321{1}^{ * * }	$- 11.440 3^{* * }$ -11.440{3}^{ * * }	$- 13.382 3^{* * }$ -13.382{3}^{ * * }
Total deaths	ARIMA	$- 8.196 2^{* * }$ -8.196{2}^{ * * }	$- 12.271 0^{* * }$ -12.271{0}^{ * * }	$- 13.478 9^{* * }$ -13.478{9}^{ * * }
	SVR	$- 3.243 0^{* * }$ -3.243{0}^{ * * }	$- 3.969 7^{* * }$ -3.969{7}^{ * * }	$- 4.008 1^{* * }$ -4.008{1}^{ * * }
	RNN	$- 7.122 7^{* * }$ -7.122{7}^{ * * }	$- 9.983 4^{* * }$ -9.983{4}^{ * * }	$- 11.479 3^{* * }$ -11.479{3}^{ * * }
	LSTM	$- 6.444 3^{* * }$ -6.444{3}^{ * * }	$- 10.958 3^{* * }$ -10.958{3}^{ * * }	$- 14.155 8^{* * }$ -14.155{8}^{ * * }
	GRU	$- 8.801 3^{* * }$ -8.801{3}^{ * * }	$- 13.753 9^{* * }$ -13.753{9}^{ * * }	$- 13.808 8^{* * }$ -13.808{8}^{ * * }

Statistical, machine learning, and deep learning models for COVID-19 forecasting in Kenya

Full Article

Paradigm

My account