long short-term memory gated recurrent unit recurrent neural network autoregressive integrated moving average support vector regression random forest Akaike information criterion Bayesian information criterion partial autocorrelation function autocorrelation function root mean square error mean absolute error mean absolute percentage error coefficient of determination machine learning deep learning coronavirus disease 2019
The coronavirus disease 2019 (COVID-19) pandemic, initiated by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus in late 2019, precipitated an unprecedented global health and economic crisis, with profound effects on diverse regions, including Africa [7,20,22,43]. In Kenya, the virus challenged a healthcare system already constrained by limited infrastructure, personnel, and financial resources, exacerbating vulnerabilities during multiple infection waves. The Kenyan government responded with measures such as lockdowns, testing campaigns, and vaccination drives, but their success hinged on accurate forecasting to inform timely resource allocation and policy interventions [17,30,31]. Accurate predictions of epidemic trends, including case numbers and mortality rates, are essential for optimizing healthcare preparedness, prioritizing interventions, and mitigating socioeconomic impacts [15,23,34,35,38]. Beyond Kenya, the pandemic underscored the global need for robust forecasting models to manage infectious diseases, particularly in low-resource settings where data and computational limitations pose significant challenges.
Epidemic forecasting leverages diverse modeling approaches to capture the complex dynamics of infectious diseases. Statistical models, such as the autoregressive integrated moving average (ARIMA), are valued for their simplicity and ability to model linear trends and seasonality in time series data [4,18,21]. However, their reliance on linear assumptions limits their effectiveness in capturing the nonlinear, volatile patterns typical of pandemics. Machine learning (ML) models, such as support vector regression (SVR) and random forest (RF), address these limitations by modeling complex, nonlinear relationships, offering improved predictive accuracy [12,14,25,29,32,33,36,44]. Deep learning (DL) models, including recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent units (GRU), excel in capturing temporal dependencies in sequential data, making them well suited for epidemic time series [1,2,42]. These advanced models have informed global health strategies, particularly in resource-limited settings where efficient forecasting can optimize scarce resources [24,28].
A study by Arora et al. [2] employed DL models, specifically RNN-based LSTM variants (deep LSTM, convolutional LSTM, and bi-directional LSTM), to predict COVID-19-positive cases across 32 Indian states and union territories, selecting the LSTM model with the lowest error for daily and weekly forecasts. Sinha et al. [40] forecasted COVID-19-confirmed cases in the USA, India, Brazil, Russia, and France, comparing artificial neural network (ANN) and RNN-based long LSTM models, using mean squared error for validation. The main finding was that LSTM outperformed ANN, indicating higher accuracy for epidemic predictions. These results suggest LSTM’s potential for informing timely public health interventions in highly affected countries. Ghafouri-Fard et al. [10] used ML and DL methods, including adaptive neuro-fuzzy inference system (ANFIS), LSTM, RNN, multilayer perceptron (MLP), and ARIMA, to predict COVID-19 case trends. It compares model performance using root-mean-squared error, mean absolute error (MAE), mean absolute percentage error (MAPE), and
Rustam et al. [37] applied ML models – linear regression (LR), least absolute shrinkage and selection operator (LASSO), support SVM, and exponential smoothing (ES) – to forecast COVID-19 cases, deaths, and recoveries over 10 days. ES outperformed other models, followed by LR and LASSO, while SVM showed poor performance across all prediction scenarios. The findings highlight the potential of ML, particularly ES, for effective COVID-19 forecasting to support decision-making.
This study contributes in epidemic forecasting by evaluating a range of forecasting models to predict COVID-19 trends in Kenya, focusing on total cases, critical cases, severe cases, and total deaths, using data from April 2020 to August 2021, to enhance epidemic preparedness in resource-constrained environments. The study conducts a comparative analysis of six models – ARIMA, SVR, RF, RNN, LSTM, and GRU – to forecast COVID-19 trends in Kenya, evaluating their performance across multiple dimensions to identify the most effective approach for public health applications.
The methodology employs an evaluation framework to compare model performance. Data from April 2020 to August 2021 are split into an 80:20 train-test ratio to ensure robust validation. Four evaluation metrics – RMSE, MAE, MAPE, and the coefficient of determination (
-
Comparative evaluation of six models – ARIMA, SVR, RF, RNN, LSTM, and GRU – for COVID-19 forecasting in Kenya, using RMSE, MAE, MAPE, and
to identify RF as the most effective model for resource-constrained settings.{R}^{2} -
Application of the DM test with MSE, MAE, and MAPE as loss functions to confirm statistically significant differences in forecasting performance, highlighting RF’s superior accuracy.
-
Demonstration of RF’s effectiveness in predicting total cases, critical cases, severe cases, and total deaths, offering practical insights for public health forecasting.
This article is organized as follows: Section 2 details the dataset and methodology, including model specifications, evaluation metrics, and the DM test. Section 3 presents the results, including descriptive statistics, forecasting outcomes, and robustness checks, supported by visualizations such as time series plots and heatmaps. Section 4 discusses the findings, emphasizing model strengths and limitations. Section 5 concludes with key insights and implications for epidemic preparedness in Kenya and similar contexts. Section 6 addresses study limitations and suggests future research directions, such as incorporating external features and cross-validation.
The dataset, obtained from the Ministry of Health website in Kenya, spans from April 15, 2020, to August 26, 2021, and includes four columns: total cases, severe cases, critical cases, and total deaths [27]. It provides daily records of these metrics, capturing the progression and impact of the COVID-19 pandemic. The analysis aims to model and predict these variables using time series forecasting and ML techniques to address the ongoing global health crisis.
ARIMA is a statistical model for forecasting univariate time series with autocorrelation and non-stationarity, integrating autoregressive (AR), differencing (I), and moving average (MA) components [13,19]. The AR component links current observations to past values, while the MA component models dependencies on past forecast errors. Differencing achieves stationarity. The ARIMA(
SVR is an extension of the SVM framework for regression tasks. It aims to find a function
To allow for infeasible constraints due to noise or outliers, slack variables
Several commonly used kernels in SVR include the following:
Linear kernel:
Polynomial kernel:
Radial basis function (RBF) kernel:
Regularization through the parameter
RF is an ensemble learning algorithm that builds multiple decision trees and aggregates their outputs to perform regression. It was introduced by Breiman [5] and is particularly effective in handling high-dimensional data, nonlinear relationships, and reducing overfitting.
In regression settings, the goal is to predict a continuous target variable
Each tree is trained on a bootstrap sample, which is created by randomly sampling (with replacement) from the original training dataset. This introduces variability among the trees. Additionally, during the tree-building process, RF introduces further randomness by selecting a random subset of features at each split rather than considering all features. This decorrelates the trees and enhances generalization.
At each node of a decision tree, the algorithm selects the best split point among the randomly chosen features based on a splitting criterion that minimizes prediction error. For regression, the most commonly used criterion is the mean-squared error between predicted and actual values in the resulting child nodes. This local optimization helps guide the recursive partitioning of the input space.
Unlike a single decision tree, which may overfit the training data, RF reduces the variance of predictions by averaging across many trees. The ensemble effect ensures that individual overfitting errors are averaged out, leading to more robust and accurate predictions.
RNNs are a class of neural architectures tailored for sequential data modeling. They maintain a hidden state
LSTM is a specialized type of RNN designed to capture long-range dependencies in sequential data by using gates to control the flow of information. Each LSTM cell contains three primary gates: the forget gate, input gate, and output gate, which regulate how information is retained, updated, and outputted at each time step [6].
The forget gate
The input gate
The cell candidate
The output gate
Finally, the hidden state
Figure 1 presents the LSTM network, which incorporates memory cells to handle long-term dependencies effectively.

Architecture of the LSTM network. Source: From ref. [39].
The GRU is a simpler version of LSTM with fewer parameters. It combines the forget and input gates of LSTM into a single update gate. The GRU has two main gates: the update gate and the reset gate [39].
The update gate
The reset gate
The candidate hidden state
Finally, the hidden state
The structure of the GRU is illustrated in Figure 2, where gating mechanisms help regulate information flow and improve efficiency.

Architecture of the GRU. Source: From ref. [39].
The predictive performance of the models is evaluated using four standard metrics: RMSE, MAE, MAPE, and
To statistically evaluate whether the forecasting performance of two competing models differs significantly, the DM test is applied [8]. The test is based on the loss differential series
The DM test evaluates the following hypotheses:
The test statistic is given by
Under the null hypothesis, the DM statistic asymptotically follows a standard normal distribution:
If the absolute value of the DM statistic exceeds the critical value from the standard normal distribution, the null hypothesis is rejected, indicating a statistically significant difference in predictive performance between the two models. In this study, the DM test is applied using MSE, MAE, and MAPE as loss functions to ensure comprehensive evaluation.
The experimental setup follows an 80:20 data split, where 80% of the time series is used for training and 20% for testing. We evaluate the forecasting performance of six models: ARIMA, SVR, RF, RNN, GRU, and LSTM. These models are applied to four key variables, with all learning carried out under a consistent framework.
To prepare the time series data for supervised learning for SVR, RF, and the RNNs, we apply a sliding window approach. 30-day sliding window is used to prepare the data. Given a univariate sequence
For the DL models (RNN, GRU, LSTM), feature scaling is performed using Min-Max normalization to stabilize and accelerate the training process. The normalization formula is
After prediction, the inverse transformation is applied to recover the original scale:
Rescaling ensures that input features are within a uniform range, preventing dominance by features with larger numerical values. This enhances the convergence behavior of DL algorithms and contributes to better generalization performance.
Hyperparameter tuning was carried out for RF, ARIMA, and SVR using grid search or automated routines, while DL models (RNN, LSTM, GRU) employed fixed configurations.
For ARIMA, the model parameters were automatically selected using the auto_arima package in Python, resulting in an optimal configuration of ARIMA(4,1,0) for all datasets. The differencing order
For SVR, hyperparameter tuning was performed using grid search, which explored the following parameter space:
-
,C\in \left[1,10,100] -
,\gamma \in \left[1\times 1{0}^{-6},1\times 1{0}^{-5},1\times 1{0}^{-4}] -
Kernel: RBF, linear and polynomial.
For the RF model, hyperparameter tuning was performed using a grid search over a comprehensive parameter space to optimize predictive performance. The following hyperparameters were explored:
-
Number of trees (
):{n}_{\text{estimators}} ,\left\{50,100,200\right\} -
Maximum tree depth (
):\hspace{0.1em}\text{max\_depth}\hspace{0.1em} ,\left\{\hspace{0.1em}\text{None}\hspace{0.1em},10,20,30\right\} -
Minimum samples required to split a node (
):\hspace{0.1em}\text{min\_samples\_split}\hspace{0.1em} ,\left\{2,5,10\right\} -
Minimum samples required at a leaf node (
):\hspace{0.1em}\text{min\_samples\_leaf}\hspace{0.1em} ,\left\{1,2,4\right\} -
Maximum number of features considered for splits (
):\hspace{0.1em}\text{max\_features}\hspace{0.1em} ,\left\{\hspace{0.1em}\text{auto},\sqrt{\text{max\_features}\hspace{0.1em}}\right\} -
Splitting criterion (
):\hspace{0.1em}\text{criterion}\hspace{0.1em} .\left\{\hspace{0.1em}\text{mse},\text{mae}\hspace{0.1em}\right\}
For the DL models , we used a fixed architecture with three layers and ReLU activation. The optimizer for all DL models was Adam with a learning rate of 0.001. Dropout was applied with a rate of 0.2 to prevent overfitting, and early stopping was used to halt training if the validation loss did not improve. The models were trained for 100 epochs with a batch size of 16.
The parameter settings for the DL models are summarized in Table 1. These settings ensure consistency across all models, enabling a fair comparison of their performance.
Parameter settings for RNN, LSTM, and GRU models
| Parameter | RNN | LSTM | GRU |
|---|---|---|---|
| Number of layers | 3 | 3 | 3 |
| Activation | ReLU | ReLU | ReLU |
| Loss function | MSE | MSE | MSE |
| Optimizer | Adam | Adam | Adam |
| Learning rate | 0.001 | 0.001 | 0.001 |
| Dropout rate | 0.2 | 0.2 | 0.2 |
| Epochs | 100 | 100 | 100 |
| Batch size | 16 | 16 | 16 |
| Units per layer | 100, 50, 25 | 100, 50, 25 | 100, 50, 25 |
| Early stopping | Yes (monitor = val_loss, patience = 10) | Yes (monitor = val_loss, patience = 10) | Yes (monitor = val_loss, patience = 10) |
Explanation of parameters for RNN, LSTM, and GRU models:
-
Number of layers: All models have three layers of recurrent units (RNN, LSTM, or GRU). More layers allow the models to capture more complex patterns in sequential data.
-
Activation function: The activation function used in all models is ReLU (rectified linear unit), which helps mitigate the vanishing gradient problem and accelerates convergence.
-
Loss function: The MSE is used as the loss function for all models. It measures the average of the squares of the errors, making it sensitive to larger errors and ensuring that the model minimizes them.
-
Optimizer: Adam (Adaptive Moment Estimation) is used for optimization in all models. It adjusts the learning rate during training to improve convergence speed and accuracy.
-
Learning rate: Set to 0.001 for all models; this determines the step size during model training. A small learning rate helps with more stable training.
-
Dropout rate: Dropout of 0.2 is used to prevent overfitting. It randomly disables 20% of neurons in the network during training, forcing the model to learn more robust representations.
-
Epochs: All models are trained for 100 epochs to ensure enough iterations for convergence.
-
Batch size: A batch size of 16 is used, meaning 16 samples are processed at once during each training iteration.
-
Units per layer: The number of units in each layer is set to 100 for the first layer, 50 for the second, and 25 for the third layer, with the goal of progressively reducing the complexity of the model.
-
Early stopping: This mechanism halts training when the validation loss stops improving, preventing overfitting and unnecessary training.
The Python packages used in this study include Numpy for numerical operations and array handling, Pandas for data manipulation and analysis, Seaborn for data visualization, Matplotlib for creating static, animated, and interactive visualizations, Scikit-learn for ML models including SVR, RF, and performance metrics, Statsmodels for statistical models and time series analysis, Pmdarima for automatic ARIMA model selection, and Keras for building and training DL models.
Figure 3 presents the time series plots of four key variables: total cases, severe cases, critical cases, and total deaths. The patterns indicate an increasing trend over time with noticeable fluctuations. The presence of volatility suggests the need for further statistical analysis, such as stationarity tests and autocorrelation diagnostics.

Time series plots of total cases, severe cases, critical cases, and total deaths. Source: Created by the authors.
Table 2 presents the summary statistics of the four key variables. The mean and standard deviation highlight high variability in the dataset. The skewness values indicate a slight positive skewness, suggesting that the distributions are slightly right-tailed. The negative kurtosis values suggest that the distributions are flatter than a normal distribution, indicating fewer extreme events.
Summary statistics
| Count | Mean | Std Dev | Min | 25% | 50% | 75% | Max | Kurtosis | Skewness | |
|---|---|---|---|---|---|---|---|---|---|---|
| Total cases | 499 | 8841.58 | 5911.12 | 146.05 | 3737.57 | 8218.49 | 12979.51 | 21393.05 |
| 0.4086 |
| Severe cases | 499 | 4210.27 | 2814.82 | 69.55 | 1779.80 | 3913.57 | 6180.72 | 10187.17 |
| 0.4086 |
| Critical cases | 499 | 1472.06 | 983.35 | 25.06 | 616.25 | 1359.01 | 2158.80 | 3570.59 |
| 0.4116 |
| Total deaths | 499 | 588.83 | 393.34 | 10.03 | 246.50 | 543.60 | 863.52 | 1428.24 |
| 0.4116 |
Table 3 shows the results of the augmented Dickey-Fuller (ADF) test for stationarity. Since the p-values are greater than 0.05, we fail to reject the null hypothesis, indicating that the time series data is non-stationary. This suggests the need for differencing or transformation before applying forecasting models.
ADF test results for stationarity
| Variable | ADF statistic |
|
|---|---|---|
| Total cases |
| 0.4360 |
| Severe cases |
| 0.4360 |
| Critical cases |
| 0.4327 |
| Total deaths |
| 0.4328 |
Figure 4 displays the ACF and PACF plots for the four variables. The slow decay in the ACF plots suggests the presence of long memory in the time series. Long memory, also known as long-range dependence, implies that past observations significantly influence future values over a prolonged period.

ACF and PACF for total cases, severe cases, critical cases, and total deaths. Source: Created by the authors.
Table 4 summarizes the forecasting performance of six models, ARIMA, SVR, RNN, GRU, LSTM, and RF across four datasets: total cases, critical cases, severe cases, and total deaths, with metrics reported to four decimal places. RF consistently outperforms all models across all datasets, achieving the lowest RMSE, MAE, and MAPE, and the highest R 2 values, nearing 1.0000. For total cases, RF yields an RMSE of 93.4117, MAE of 35.9370, MAPE of 0.2668, and R 2 of 0.9995; for critical cases, RMSE of 17.5342, MAE of 7.3318, MAPE of 0.3330, and R 2 of 0.9994; for severe cases, RMSE of 44.4818, MAE of 17.1128, MAPE of 0.2668, and R 2 of 0.9995; and for total deaths, RMSE of 7.0137, MAE of 2.9327, MAPE of 0.3330, and R 2 of 0.9994. These results highlight RF’s superior accuracy and robustness in capturing nonlinear patterns in COVID- 19 data.
Performance metrics (RMSE, MAE, MAPE, and
| Dataset | Model | RMSE | MAE | MAPE (%) |
|
|---|---|---|---|---|---|
| Total cases | ARIMA | 13870.6770 | 10916.6357 | 82.0056 |
|
| SVR | 5589.6811 | 3874.5758 | 26.4893 |
| |
| RNN | 365.3872 | 289.7108 | 2.3229 | 0.9929 | |
| LSTM | 338.4482 | 281.7235 | 2.4740 | 0.9939 | |
| GRU | 215.3505 | 177.0998 | 1.5449 | 0.9975 | |
| RF | 93.4117 | 35.9370 | 0.2668 | 0.9995 | |
| Critical cases | ARIMA | 2055.3009 | 1599.7675 | 70.9180 |
|
| SVR | 467.2761 | 216.0696 | 6.8270 | 0.5970 | |
| RNN | 62.5152 | 54.0835 | 2.8693 | 0.9928 | |
| LSTM | 185.1695 | 144.1515 | 6.6211 | 0.9367 | |
| GRU | 42.5913 | 35.1251 | 1.8999 | 0.9967 | |
| RF | 17.5342 | 7.3318 | 0.3330 | 0.9994 | |
| Severe cases | ARIMA | 6593.2169 | 5188.9595 | 81.8556 |
|
| SVR | 2223.5193 | 1341.9956 | 17.5188 |
| |
| RNN | 140. 1730 | 110.0014 | 2.3796 | 0.9954 | |
| LSTM | 281.8612 | 221.9638 | 3.9355 | 0.9814 | |
| GRU | 140.4860 | 113.2804 | 1.9507 | 0.9954 | |
| RF | 44.4818 | 17.1128 | 0.2668 | 0.9995 | |
| Total deaths | ARIMA | 822.0758 | 639.8712 | 70.9140 |
|
| SVR | 89.1646 | 34.3085 | 2.6381 | 0.9083 | |
| RNN | 23.3445 | 17.9225 | 2.0940 | 0.9937 | |
| LSTM | 40.4583 | 31.6118 | 4.0341 | 0.9811 | |
| GRU | 21.3337 | 18.6039 | 2.5657 | 0.9947 | |
| RF | 7.0137 | 2.9327 | 0.3330 | 0.9994 |
All values are reported to four decimal places. Bold rows indicate the RF model, which consistently outperforms the other models.
ARIMA shows the weakest performance, with high error metrics and negative R
2 values, indicating poor fit. For total cases, ARIMA’s RMSE is 13870.6770, MAE is 10916.6357, MAPE is 82.0056, and R
2 is
RF’s consistent performance across datasets underscores its suitability for epidemic forecasting for this study where accuracy and efficiency are vital. Its low errors and near-perfect R 2 values indicate strong generalization, ideal for public health applications. ARIMA and SVR’s limitations in complex data, and the moderate performance of RNN, GRU, and LSTM, suggest they are better suited for long-term temporal patterns. These findings support using ensemble methods like RF, with potential for future hybrid models combining RF and DL to improve accuracy.
Figure 5 illustrates the comparison between predicted and actual values for all four datasets. The alignment of predictions with actual values across these datasets demonstrates RF’s consistent predictive accuracy, with minimal deviations from actual values compared to other models. RF’s consistent performance across datasets underscores its suitability for epidemic forecasting in resource-constrained settings like Kenya, where accuracy and efficiency are vital. Its low errors and near-perfect R 2 values indicate strong generalization.

Prediction vs actual values for total cases, critical cases, severe cases, and total deaths datasets. Source: Created by the authors.
Figure 6 presents a bar plot comparing the forecasting performance metrics of six models.

Bar plot of performance metrics (RMSE, MAE, MAPE, and
Figures 7, 8, 9 and 10 confirm that RF predictions exhibit a strong linear correlation with actual values across all datasets. The alignment of majority of points along the diagonal in each Scatterplot suggests that RF consistently provides accurate predictions with minimal deviations.

Scatterplot of observed vs predicted values for the total cases dataset. The RF model demonstrates a strong linear relationship, indicating accurate predictions. Source: Created by the authors.

Scatterplot of observed vs predicted values for the critical cases dataset. The RF model demonstrates a strong linear relationship, indicating accurate predictions. Source: Created by the authors.

Scatterplot of observed vs predicted values for the Severe Cases dataset. The RF model demonstrates a strong linear relationship, indicating accurate predictions. Source: Created by the authors.

Scatterplot of observed vs predicted values for the total deaths dataset. The RF model demonstrates a strong linear relationship, indicating accurate predictions. Source: Created by the authors.
Figure 11 illustrates the distribution of forecasting errors for the total cases, critical cases, severe cases, and total deaths datasets. The plot compares the error distributions across six models demonstrating that RF consistently exhibits the lowest median error, minimal variance, and fewer outliers across all datasets. In contrast, models such as ARIMA and SVR show higher error variance and more outliers, while RNN, GRU, and LSTM display moderate performance, suggesting they are better suited for long-term temporal patterns. These findings highlight RF’s superior balance of accuracy and error minimization.

Boxplot of forecasting error distributions for six models – ARIMA, SVR, RNN, LSTM, GRU, and RF – across four COVID-19 datasets. Each model’s errors are grouped by dataset, with boxes showing the interquartile range, median, and outliers of the forecasting errors. The RF model consistently exhibits the lowest median error, minimal variance, and fewer outliers across all datasets, highlighting its superior predictive accuracy and stability. Source: Created by the authors.
A heatmap, as shown in Figure 12, visually represents model performance through color gradients, clearly highlighting differences in RMSE, MAE, and MAPE across total cases, critical cases, severe cases, and total deaths datasets. All models demonstrate positive improvement over the RF baseline, with ARIMA exhibiting the highest enhancement.

Heatmap of model performance metrics (RMSE, MAE, MAPE) across total cases, critical cases, severe cases, and total deaths datasets. All models show improvement over RF, with ARIMA exhibiting the highest improvement compared to RF. Source: Created by the authors.
As a robustness check, the DM test, as shown in Table 5, assesses the statistical significance of differences in forecasting errors between the best-performing model, RF, and benchmark models (ARIMA, SVR, RNN, LSTM, GRU) across total cases, critical cases, severe cases, and total deaths datasets, using MSE, MAE, and MAPE as loss functions. All comparisons are significant at the 1% level (
DM test statistics comparing RF to benchmark models across datasets and loss functions
| Dataset | Benchmark model | MSE | MAE | MAPE |
|---|---|---|---|---|
| Total cases | ARIMA |
|
|
|
| SVR |
|
|
| |
| RNN |
|
|
| |
| LSTM |
|
|
| |
| GRU |
|
|
| |
| Critical cases | ARIMA |
|
|
|
| SVR |
|
|
| |
| RNN |
|
|
| |
| LSTM |
|
|
| |
| GRU |
|
|
| |
| Severe cases | ARIMA |
|
|
|
| SVR |
|
|
| |
| RNN |
|
|
| |
| LSTM |
|
|
| |
| GRU |
|
|
| |
| Total deaths | ARIMA |
|
|
|
| SVR |
|
|
| |
| RNN |
|
|
| |
| LSTM |
|
|
| |
| GRU |
|
|
|
Note:
The primary objective of this study was to evaluate the forecasting performance of statistical, ML, and DL models for predicting COVID-19 trends in Kenya, focusing on total cases, critical cases, severe cases, and total deaths. By comparing models such as ARIMA, SVR, RF, RNN, LSTM, and GRU, the study aimed to identify the most effective approach for epidemic forecasting in a resource-constrained setting. A robust evaluation framework, including multiple error metrics and the DM test, was employed to assess predictive accuracy and statistical significance of differences in forecasting errors. This comprehensive analysis sought to provide actionable insights for public health decision-making by determining which models best capture the complex dynamics of epidemic data in Kenya.
The key finding of this study is that ensemble ML methods, particularly RF, offer superior predictive accuracy and computational efficiency for COVID-19 forecasting in Kenya, making them highly suitable for resource-limited environments. While DL models such as GRU and LSTM show promise in capturing temporal dependencies, their performance is generally outshone by RF, which consistently excels across all datasets. In contrast, traditional statistical models like ARIMA struggle with the nonlinear patterns inherent in epidemic data, highlighting the advantage of ML approaches in such contexts. The DM test reinforces these findings by confirming significant differences in forecasting performance, with RF typically outperforming benchmarks, except in specific cases where other models show marginal advantages.
These findings underscore the potential of ML, particularly RF, to enhance epidemic preparedness, where efficient resource allocation is critical. The study advocates for the adoption of ensemble methods in public health forecasting while suggesting that future research explore hybrid models combining statistical and ML techniques to further improve accuracy. By leveraging such models, policymakers can make informed decisions to mitigate the impact of infectious diseases.
This study, while comprehensive in evaluating six models for COVID-19 forecasting in Kenya, has several limitations. The analysis relied on a single 80:20 train-test split, focusing on one-step-ahead predictions, which may not fully capture the models’ performance in multi-step forecasting scenarios. Time-aware cross-validation, such as walk-forward validation, could provide a more robust assessment but was not implemented due to computational constraints. Additionally, the study was limited to six models, excluding simpler linear models or other advanced techniques, such as hybrid approaches or transformer-based architectures, which might offer complementary insights. The absence of external features, such as new COVID-19 variants, mobility patterns, or socioeconomic indicators, limits the models’ ability to account for real-world complexities. To address these limitations, future research should incorporate time-aware cross-validation to enhance model robustness and explore multi-step forecasting to better simulate real-world epidemic scenarios. Expanding the model space to include linear models, hybrid statistical-ML approaches, or transformer-based architectures, as suggested in recent studies , could improve predictive power. Incorporating external features, such as policy changes, mobility data, or socioeconomic factors, and real-time data streams, would provide a more holistic view of epidemic dynamics.