Abstract
Synthetic data is a useful solution when data is scarce or private, as it supports reproducible experimentation, privacy-preserving data sharing, data re-purposing, and robust evaluation of data systems. This study presents a benchmark for tabular data synthesis (TDS) tools, evaluating their performance across six critical dimensions: handling dataset imbalance, dataset augmentation, handling missing values, privacy, machine learning (ML) utility, and computational performance. Our findings provide practical insights to guide tool selection based on specific use cases and constraints. We assessed 13 tools across 15 datasets from different use cases, focusing on prosumer hardware configurations for end-users and highlight the trade-offs among various TDS models. Sampling-based tools like SMOTE excelled in handling imbalance and efficiency but lacked privacy and variability. Hybrid and Transformer models demonstrated strong results across most dimensions but required substantial computational resources. Diffusion models achieved high scores but were complex to configure, while Bayesian Networks offered efficiency and privacy with limitations in utility. The study also emphasizes non-functional considerations such as runtime, resource efficiency, and configuration challenges. The source code and data have been made available at the Github Repository.
