CDSim: A Synthetic Climate Data Simulation Framework for Temperature and Rainfall Modelling in R

Isaac Osei; Acheampong Baafi-Adomako; Sivaparvathi Dusari; Anil Carie

doi:10.5334/jors.666

Full Article

(1) Overview

Introduction

Climate and weather datasets underpin a wide range of scientific and applied research, including environmental modelling, hydrological forecasting, agricultural planning, ecological management, and climate risk assessment [1, 2, 3]. Long-term, continuous records of temperature and precipitation are particularly critical for validating climate models, assessing climate variability and change, and supporting data-driven approaches such as statistical analysis and machine learning-based prediction systems [4, 5]. In educational settings, these datasets also serve as essential resources for teaching climate data processing, modelling workflows, and reproducible research practices.

Despite the growing availability of global and regional climate products from sources such as ERA5, NASA POWER, NOAA, and satellite-based rainfall datasets, practical barriers to their use remain [6, 7]. Many regions—particularly in developing countries—continue to experience incomplete station records, data gaps, or restricted access to high-quality observational data [8, 9]. In addition, existing datasets often require substantial preprocessing, involve complex file formats, or present licensing and data-sharing constraints that limit their reuse in open research and teaching environments. Spatial and temporal resolution mismatches further reduce their suitability for localised studies or classroom applications.

These challenges are especially pronounced for students and early-career researchers, who frequently lack access to ready-to-use climate datasets for learning analytical pipelines, testing statistical or machine learning methods, or developing reproducible workflows. As a result, methodological experimentation and teaching often rely on fragmented or ad hoc data sources.

Synthetic climate datasets offer a pragmatic and increasingly accepted alternative in such contexts [10]. When transparently generated and evaluated against known climatological behaviour, synthetic time series can support experimentation, benchmarking, education, and method development without reproducing proprietary observations or violating data-sharing restrictions [11, 12]. They enable controlled simulations, facilitate reproducibility, and lower barriers to entry for climate data analysis.

In response to these needs, this study introduces CDSim, an R-based simulation framework designed to generate synthetic yet climatologically realistic monthly time series data for key climate variables, including temperature and rainfall, across multiple automated or user-defined stations. CDSim provides a flexible, lightweight, and reproducible environment for generating climate data tailored to specific temporal spans and spatial contexts. By supporting experimentation, workflow testing, and educational use, CDSim aims to enhance climate data accessibility and methodological transparency, particularly in data-constrained regions.

Methods and model design

The CDSim framework generates synthetic monthly climate time series by combining deterministic seasonal structure with stochastic variability. For each station s and month t, the mean temperature follows a harmonic seasonal signal,

T_{mean} (t, s) = μ (s) + A (s) sin (\frac{2 π t}{12}) + ϵ_{t}, ϵ_{t} \sim N (0, σ^{2}),

where μ(s) and A(s) are station-specific baseline and amplitude parameters, and σ = 0.5. Minimum and maximum temperatures are generated as correlated series around this mean,

T_{\min} (t, s) = T_{mean} (t, s) - Z_{t}|, Z_{t} \sim N (6, 1),

T_{\max} (t, s) = T_{mean} (t, s) + Z_{t}|

ensuring realistic diurnal separation while preserving seasonal coherence.

Monthly rainfall totals are simulated using a gamma distribution,

P (t, s) \sim Γ (k, θ (t, s)),

where the scale parameter θ(t,s) varies seasonally to represent monsoon intensity and station-level variability, while the shape parameter k controls precipitation intermittency and skewness. This formulation follows established stochastic weather generator principles [11, 12].

All stochastic components are fully reproducible through controlled random seed initialisation. The framework does not replicate observed time series or extreme events but prioritises realistic statistical behaviour and long-term structure for methodological testing, education, and benchmarking.

Implementation and architecture

CDSim is implemented as an R package within the R statistical computing environment [13], providing a modular and reproducible framework for generating synthetic monthly climate time series.

The core simulation function accepts station metadata (latitude, longitude, and station identifier), a user-defined temporal range, and an optional random seed, and returns station-specific series for temperature and rainfall. Seasonal signals and stochastic components are computed internally and combined into structured station-level time series. Outputs are provided in both comma-separated values (CSV) and Network Common Data Format (NetCDF), ensuring compatibility with standard climate analysis, modelling, and data-sharing workflows. Table 1 summarises the key CDSim functions and their primary usage.

Table 1

Summary of key CDSim functions (Source: Authors’ own compilation).

FUNCTION	PURPOSE	EXAMPLE SYNTAX
create_stations()	Create real or synthetic weather stations	create_stations(source=“file.csv”)
simulate_climate_series()	Generate synthetic data for temperature and rainfall time series	simulate_climate_series(stations, 2000, 2025)
write_station_csv()	Export simulated data to CSV format	write_station_csv(data, file=“file1.csv”)
write_station_netcdf()	Export simulated data into NetCDF format	write_station_netcdf(data, file=“file1.nc”)
plot_station_timeseries()	Produce time series visualisations	plot_station_timeseries(data, ‘Station_1’, var=’Avg. Tx’)

CDSim follows a modular architecture consisting of four primary components: (i) station metadata generation, (ii) climate simulation engine, (iii) visualisation utilities, and (iv) export interfaces. The simulation engine integrates deterministic seasonal signal generation with stochastic sampling modules. Station metadata are passed to the simulation layer, which constructs temperature and rainfall series using vectorised computations. The resulting datasets are stored in structured data frames before being routed to plotting or export functions.

The architecture separates data generation from input/output operations, enabling flexible reuse and future extension. Visualisation is implemented independently using the ggplot2 package [14], while NetCDF export relies on the ncdf4 [15] dependency.

Computationally, CDSim is lightweight and memory-efficient for typical research and teaching scenarios. Memory usage scales linearly with the number of stations and simulated years. Empirical testing indicates that simulating 50 stations over 50 years of monthly data remains well within typical desktop memory limits (under approximately 100 MB of RAM). Processing time is minimal due to vectorised operations, with typical simulations completing within seconds for moderate station counts (<100 stations over 50 years). The package is not optimised for high-resolution daily simulations or very large-scale spatial grids, which may require additional memory management strategies.

Quality control

To ensure reliability and reproducibility, CDSim incorporates multiple levels of quality control aligned with CRAN submission standards and open research software practices.

Unit Testing

Automated unit tests were implemented using the testthat framework. Tests verify input validation (station metadata structure, temporal bounds), deterministic seasonal signal correctness, stochastic reproducibility under fixed random seeds, output schema integrity, and file export functionality for both CSV and NetCDF formats.

Integration Testing

End-to-end workflow tests were conducted to validate full simulation pipelines, including station creation, climate simulation, visualisation, and data export. Multi-station simulations were evaluated to confirm structural consistency and cross-function compatibility.

Reproducibility Checks

All stochastic components are controlled via user-defined random seeds. Identical seeds produce identical outputs across platforms. Cross-platform consistency was verified through successful CRAN checks on Linux, Windows, and macOS environments.

Statistical Validation

Simulation outputs were evaluated for climatological plausibility by examining seasonal amplitude preservation, mean and variance stability, rainfall distribution skewness, and non-negativity constraints. Simulated temperature series exhibit smooth harmonic structure with controlled Gaussian variability, while rainfall distributions maintain expected gamma-distribution characteristics.

All automated tests passed successfully, and CRAN checks reported no errors, warnings, or notes at the time of publication. Test coverage includes core simulation routines, export functions, and input validation modules.

Validation and evaluation

The CDSim outputs were evaluated to ensure realistic and coherent climate behaviour. Seasonal consistency was assessed by confirming that simulated temperature and rainfall series reproduce expected annual cycles and timing. Key statistical properties, including central tendency, variability, and distributional shape, were examined to verify climatological plausibility, with temperature showing smooth seasonal variability and rainfall exhibiting positively skewed behaviour.

Simulated series were further compared against observational envelopes derived from Ghana Meteorological Agency data and published literature on Ghanaian climate and the West African monsoon. Although the synthetic data do not replicate observed values, they remain within realistic climatological ranges.

Limitations are acknowledged: the dataset does not represent actual historical events or extremes and is not intended for operational or policy use. Instead, the evaluation focuses on statistical realism and long-term structure, supporting methodological testing, education, and reproducible research.

Application example

In this section, we demonstrate how the CDSim package can be used to simulate climate data. Figure 1 provides an overview of the CDSim workflow pipeline, while Figure 2 presents an example implementation carried out in the RStudio environment.

Overview of the CDSim workflow (Source: Authors’ own illustration).

Demonstration of CDSim workflow (Source: Screenshot from RStudio).

Station metadata are first defined, followed by simulation of temperature and rainfall time series using deterministic seasonal signals and stochastic variability. The resulting datasets are exported in CSV and NetCDF formats and can be visualised using built-in plotting utilities, which are implemented using the ggplot2 package [14].

Figures 3, 4, 5 present the simulated temporal patterns of maximum temperature, minimum temperature, and rainfall for the Accra station, generated by the CDSim package and visualised using the built-in plotting functions as part of the demonstration workflow.

Time series of simulated maximum temperature for the Accra station (Source: Plotted in RStudio using the CDSim framework).

Time series of simulated minimum temperature for the Accra station (Source: Plotted in RStudio using the CDSim framework).

Time series of simulated total rainfall for the Accra station (Source: Plotted in RStudio using the CDSim framework).

Conclusion and future development

CDSim offers a lightweight and accessible framework for generating realistic synthetic climate datasets, addressing common challenges related to data availability, licensing restrictions, and reproducibility in climate research and education. By combining simple seasonal structure with stochastic variability, the framework enables users to prototype analytical workflows, test modelling approaches, and develop teaching materials in a controlled and transparent environment.

Future development of CDSim will focus on extending its realism and flexibility. Planned enhancements include explicit spatial correlation modelling using approaches such as Gaussian processes, improved representation of extremes and rare events, and integration with machine learning-based climate generators, including variational autoencoders and related architectures. These extensions will further broaden the applicability of CDSim for methodological research while maintaining its emphasis on reproducibility and open access.

(2) Availability

Operating system

Platform independent

Programming language

R Version 4.5.1 or higher

Additional system requirements

2 GB RAM, 5 GB HDD or SSD

Dependencies

ggplot2 and ncdf4

List of contributors

Isaac Osei, Acheampong Baafi-Adomako, and Sivaparvathi Dusari

Software location

Archive (e.g., institutional repository, general repository) (required—please see instructions on journal website for depositing archive copy of software in a suitable repository)

Name: Comprehensive R Archive Network (CRAN)
Persistent identifier: 10.32614/CRAN.package.CDSim
Licence: MIT
Publisher: Isaac Osei
Version published: 0.1.1
Date published: 15/12/25

Code repository (e.g., SourceForge, GitHub etc.) (required)

Name: GitHub
Identifier: https://github.com/ikemillar/CDSim
Licence: MIT
Date published: 25/11/25

Language

English

(3) Reuse Potential

CDSim is designed for broad reuse across climate science, environmental modelling, data science, and education. Within climate and hydrological research, the package can be used to generate synthetic temperature and rainfall time series for testing statistical methods, validating analysis pipelines, benchmarking machine learning models, and demonstrating reproducible workflows when observational data are unavailable or restricted. The software is particularly suitable for teaching applications, including coursework in climate data analysis, time series modelling, and environmental data processing.

Beyond climate science, CDSim can be reused in machine learning, geospatial analytics, and software engineering research for prototyping data ingestion, visualisation, and modelling pipelines. The modular design allows users to extend the framework by introducing alternative seasonal functions, spatial correlation structures, extreme-event generators, or additional climate variables. Outputs in CSV and NetCDF formats facilitate integration with R, Python, GIS platforms, and cloud-based workflows.

The software is openly distributed under the MIT License via CRAN, permitting modification and redistribution with attribution. Documentation, including package vignettes, reproducible examples, and function reference materials, is distributed with the package via CRAN. User support is offered through the CRAN package page and the associated project repository, where users may report issues or request enhancements. Contributions and extensions are encouraged via direct contact with the corresponding author.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Isaac Osei: Conceptualisation, Methodology, Software, Visualisation, Writing – original draft.

Acheampong Baafi-Adomako: Validation, Writing – review and editing.

Sivaparvathi Dusari: Data curation, Formal analysis, Software.

Anil Carie: Supervision, Methodology, Writing – review and editing.