Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users

Maria Fernanda Davila Restrepo; Benjamin Wollmer; Fabian Panse; Wolfram Wingerath

doi:10.5334/dsj-2025-037

Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users

Data Science Journal

Volume 24 (2025): Issue 0

By: Maria Fernanda Davila Restrepo , Benjamin Wollmer, Fabian Panse and Wolfram Wingerath

Open Access

|Dec 2025

Abstract

Synthetic data is a useful solution when data is scarce or private, as it supports reproducible experimentation, privacy-preserving data sharing, data re-purposing, and robust evaluation of data systems. This study presents a benchmark for tabular data synthesis (TDS) tools, evaluating their performance across six critical dimensions: handling dataset imbalance, dataset augmentation, handling missing values, privacy, machine learning (ML) utility, and computational performance. Our findings provide practical insights to guide tool selection based on specific use cases and constraints. We assessed 13 tools across 15 datasets from different use cases, focusing on prosumer hardware configurations for end-users and highlight the trade-offs among various TDS models. Sampling-based tools like SMOTE excelled in handling imbalance and efficiency but lacked privacy and variability. Hybrid and Transformer models demonstrated strong results across most dimensions but required substantial computational resources. Diffusion models achieved high scores but were complex to configure, while Bayesian Networks offered efficiency and privacy with limitations in utility. The study also emphasizes non-functional considerations such as runtime, resource efficiency, and configuration challenges. The source code and data have been made available at the Github Repository.

References

Ang, Y., Huang, Q., Bao, Y., Tung, A.K.H. and Huang, Z. (2023) ‘TSGBench: Time Series Generation Benchmark’, Proc VLDB Endow, 17(3), pp. 305–318. Available at: 10.14778/3632093.3632097
Open DOI Search in Google Scholar Back to article
Arjovsky, M., Chintala, S. and Bottou, L. (2017) ‘Wasserstein GAN’. Available at: https://arxiv.org/abs/1701.07875.
Search in Google Scholar Back to article
Banerjee, S. (2016) ‘Airline Dataset’. Available at: https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset.
Search in Google Scholar Back to article
Becker, B. and Kohavi, R. (1996) ‘Adult’. UCI Machine Learning Repository. Available at: 10.24432/C5XW20
Open DOI Search in Google Scholar Back to article
BlastChar (2017) ‘Customer Churn Dataset’. Available at: https://www.kaggle.com/datasets/blastchar/telco-customer-churn.
Search in Google Scholar Back to article
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M. and Kasneci, G. (2023) ‘Language models are realistic tabular data generators’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–18. Available at: https://openreview.net/forum?id=cEygmQNOeI.
Search in Google Scholar Back to article
Brandt, J. and Lanzén, E. (n.d.) ‘A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification’, Probability Theory and Statistics. Available at: https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162.
Search in Google Scholar Back to article
Bruno, N. and Chaudhuri, S. (2005) ‘Flexible Database Generators’, Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1097–1107. Available at: http://www.vldb.org/archives/website/2005/program/paper/wed/p1097-bruno.pdf.
Search in Google Scholar Back to article
Cantalupo, S. (2021) ‘SMOGN: Synthetic Minority Oversampling TEchnique for Regression with Gaussian Noise’. Available at: https://github.com/nickkunz/smogn, gitHub repository.
Search in Google Scholar Back to article
Casola, S., Lauriola, I. and Lavelli, A. (2022) ‘Pre-trained transformers: an empirical comparison’, Machine Learning with Applications, 9, p. 100334. Available at: 10.1016/j.mlwa.2022.100334
Open DOI Search in Google Scholar Back to article
Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M. and Narang, P. (2024) ‘A universal metric for robust evaluation of synthetic tabular data’, IEEE Transactions on Artificial Intelligence, 5(1), pp. 300–309. Available at: 10.1109/TAI.2022.3229289
Open DOI Search in Google Scholar Back to article
City of Los Angeles (2013) ‘City Payroll Dataset’. Available at: https://www.kaggle.com/datasets/cityofLA/city-payroll-data.
Search in Google Scholar Back to article
Davila, R.M.F., Groen, S., Panse, F. and Wingerath, W. (2025) ‘Navigating tabular data synthesis research understanding user needs and tool capabilities’, SIGMOD Record, 53(4), pp. 18–35. Available at: 10.1145/3712311.3712315
Open DOI Search in Google Scholar Back to article
Davila, R.M.F., Turaev, A. and Wingerath, W. (2025) ‘Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis’. Available at: https://arxiv.org/abs/2509.20768.
Search in Google Scholar Back to article
Dua, D. and Graff, C. (2017) ‘UCI machine learning repository’. Available at: https://archive.ics.uci.edu/ml.
Search in Google Scholar Back to article
European Parliament (2023) ‘Boosting data sharing in the EU: what are the benefits?’. Available at: https://www.europarl.europa.eu/news/en/headlines/society/20220331STO26411/boosting-data-sharing-in-the-eu-what-are-the-benefits (Accessed: 2024-10-30).
Search in Google Scholar Back to article
Ge, C., Mohapatra, S., He, X. and Ilyas, I.F. (2021) ‘Kamino: constraint-aware differentially private data synthesis’, Proceedings of the VLDB Endowment (PVLDB), 14(10), pp. 1886–1899. Available at: 10.14778/3467861.3467876
Open DOI Search in Google Scholar Back to article
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L. and Sales, A.P. (2020) ‘Generation and evaluation of synthetic patient data’, BMC Medical Research Methodology, 20, p. 108. Available at: 10.1186/s12874-020-00977-1
Open DOI Search in Google Scholar Back to article
Gray, J., Sundaresan, P., Englert, S., Baclawski, K. and Weinberger, P.J. (1994) ‘Quickly generating billion-record synthetic databases’, Proceedings of the International Conference on Management of Data (SIGMOD), pp. 243–252. Available at: 10.1145/191839.191886
Open DOI Search in Google Scholar Back to article
harlfoxem, K.C. (2016) ‘House sales in King County dataset’. Available at: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.
Search in Google Scholar Back to article
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022) ‘Synthetic data generation for tabular health records: A systematic review’, Neurocomputing, 493, pp. 28–45. Available at: 10.1016/j.neucom.2022.04.053
Open DOI Search in Google Scholar Back to article
Janosi, A., Steinbrunn, W., Pfisterer, M. and Detrano, R. (1989) ‘Heart Disease’, UCI machine learning repository. Available at: 10.24432/C52P4X
Open DOI Search in Google Scholar Back to article
Johnson, B. (2013) ‘Wilt’, UCI machine learning repository. Available at: 10.24432/C5KS4M
Open DOI Search in Google Scholar Back to article
Kahn, M. (n.d.) ‘Diabetes’, UCI machine learning repository. Available at: 10.24432/C5T59G
Open DOI Search in Google Scholar Back to article
Kim, J., Lee, C. and Park, N. (2023) ‘STaSy: Score-based Tabular data Synthesis’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–27. Available at: https://openreview.net/pdf/7cc08c44de490f3e79794b5827aa36b84f99c4c3.pdf.
Search in Google Scholar Back to article
Kotelnikov, A., Baranchuk, D., Rubachev, I. and Babenko, A. (2023) ‘TabDDPM: modelling tabular data with diffusion models’, Proceedings of the International Conference on Machine Learning (ICML), pp. 17564–17579. Available at: https://proceedings.mlr.press/v202/kotelnikov23a.html.
Search in Google Scholar Back to article
Kumar, H. (2020) ‘Medical insurance price prediction dataset’. Available at: https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction/data.
Search in Google Scholar Back to article
Lemaître, G., Nogueira, F. and Aridas, C.K. (2017) ‘Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning’, Proceedings of the JMLR Workshop of the International Conference on Machine Learning (ICML), 18(17), pp. 1–5. Available at: http://jmlr.org/papers/v18/16-365.
Search in Google Scholar Back to article
Liu, T., Qian, Z., Berrevoets, J. and van der Schaar, M. (2023) ‘GOGGLE: Generative modelling for tabular data by learning relational structure’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–22. Available at: https://openreview.net/pdf?id=fPVRcJqspu.
Search in Google Scholar Back to article
LLC, G. (2010) ‘Kaggle: data science platform and datasets’. Available at: https://www.kaggle.com.
Search in Google Scholar Back to article
Mescheder, L., Geiger, A. and Nowozin, S. (2018) ‘Which training methods for GANs do actually converge?’. Available at: https://arxiv.org/abs/1801.04406.
Search in Google Scholar Back to article
Nash, W., Sellers, T., Talbot, S., Cawthorn, A. and Ford, W. (1994) ‘Abalone’, UCI machine learning repository. Available at: 10.24432/C55C7W
Open DOI Search in Google Scholar Back to article
Neufeld, A., Moerkotte, G. and Lockemann, P.C. (1993) ‘Generating consistent test data for a variable set of general consistency constraints’, VLDB Journal, 2(2), pp. 173–213. Available at: http://www.vldb.org/journal/VLDBJ2/P172.pdf.
Search in Google Scholar Back to article
Nugent, C. (n.d.) ‘California housing prices’. Available at: https://www.kaggle.com/datasets/camnugent/california-housing-prices.
Search in Google Scholar Back to article
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018) ‘Data Synthesis based on generative adversarial networks’, Proceedings of the VLDB Endowment (PVLDB), 11(10), pp. 1071–1083. Available at: 10.14778/3231751.3231757
Open DOI Search in Google Scholar Back to article
Patki, N., Wedge, R. and Veeramachaneni, K. (2016) ‘The synthetic data vault’, Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. Available at: 10.1109/DSAA.2016.49
Open DOI Search in Google Scholar Back to article
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011) ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, pp. 2825–2830
Search in Google Scholar Back to article
Qian, Z., Davis, R. and van der Schaar, M. (2023) ‘Synthcity: A benchmark framework for diverse use cases of tabular synthetic data’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS).
Search in Google Scholar Back to article
Quinlan, J.R. (1987) ‘Credit Approval’, UCI machine learning repository. Available at: 10.24432/C5FS30
Open DOI Search in Google Scholar Back to article
Roe, B. (2005) ‘MiniBooNE particle identification’, UCI machine learning repository. Available at: 10.24432/C5QC87
Open DOI Search in Google Scholar Back to article
Santangelo, A. and Others (2025) ‘SynthRO: A dashboard-based benchmarking framework for health-related synthetic tabular data’, Proceedings of the International Conference on Artificial Intelligence in Medicine (AIME), To appear.
Search in Google Scholar Back to article
SciPy (2024a) Jensen-Shannon Divergence — SciPy v1.10.1 Manual.
Search in Google Scholar Back to article
SciPy (2024b) Kolmogorov-Smirnov Test — SciPy v1.10.1 Manual.
Search in Google Scholar Back to article
SciPy (2024c) Kullback-Leibler Divergence — SciPy v1.10.1 Manual.
Search in Google Scholar Back to article
SciPy (2024d) Wasserstein Distance — SciPy v1.10.1 Manual.
Search in Google Scholar Back to article
Solatorio, A.V. and Dupriez, O. (2023) ‘REaLTabFormer: Generating realistic relational and tabular data using transformers’, CoRR, abs/2302.02041, pp. 1–17. Available at: 10.48550/ARXIV.2302.02041.
Open DOI Search in Google Scholar Back to article
Suh, N., Lin, X., Hsieh, D., Honarkhah, M. and Cheng, G. (2023) ‘AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing’, CoRR, abs/2310.15479, pp. 1–12. Available at: 10.48550/ARXIV.2310.15479.
Open DOI Search in Google Scholar Back to article
Torgo, L. (2014) ‘House Dataset’. Available at: https://www.openml.org/search?type=data&sort=runs&id=574&status=active.
Search in Google Scholar Back to article
Whiteson, D. (2014) ‘HIGGS’, UCI machine learning repository. Available at: 10.24432/C5V312
Open DOI Search in Google Scholar Back to article
Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019) ‘Modeling tabular data using conditional GAN’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 7333–7343. Available at: https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html.
Search in Google Scholar Back to article
Zhang, H., Zhang, J., Shen, Z., Srinivasan, B., Qin, X., Faloutsos, C., Rangwala, H. and Karypis, G. (2024) ‘Mixed-type tabular data synthesis with score-based diffusion in latent space’, Proceedings of the International Conference on Learning Representations (ICLR). Available at: https://openreview.net/forum?id=4Ay23yeuz0.
Search in Google Scholar Back to article
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D. and Xiao, X. (2017) ‘PrivBayes: Private data release via Bayesian networks’, ACM Transactions on Database Systems, 42(4), pp. 25:1–25:41. Available at: 10.1145/3134428
Open DOI Search in Google Scholar Back to article
Zhang, Y., Zaidi, N.A., Zhou, J. and Li, G. (2022) ‘GANBLR++: Incorporating capacity to generate numeric attributes and leveraging unrestricted Bayesian networks’, Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 298–306. Available at: 10.1137/1.9781611977172.34
Open DOI Search in Google Scholar Back to article
Zhao, Z., Birke, R. and Chen, L.Y. (2025) ‘TabuLa: Harnessing language models for tabular data synthesis’, Advances in Knowledge Discovery and Data Mining. PAKDD. Berlin, Heidelberg: Springer-Verlag, pp. 247–259. Available at: 10.1007/978-981-96-8186-0_20
Open DOI Search in Google Scholar Back to article
Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021) ‘CTAB-GAN: Effective table data synthesizing’, Proceedings of the Asian Conference on Machine Learning (ACML), pp. 97–112. Available at: https://proceedings.mlr.press/v157/zhao21a.html.
Search in Google Scholar Back to article
Zhao, Z., Kunar, A., Birke, R., der Scheer, H.V. and Chen, L.Y. (2024) ‘CTAB-GAN+: enhancing tabular data synthesis’, Frontiers in Big Data, 6. Available at: 10.3389/fdata.2023.1296508
Open DOI Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.5334/dsj-2025-037 | Journal eISSN: 1683-1470

Journal RSS Feed

Language: English

Page range: 37 - 37

Submitted on: Jul 7, 2025

Accepted on: Nov 19, 2025

Published on: Dec 9, 2025

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

artificial tabular data,

tabular data synthesis,

generative models,

synthetic data benchmark

© 2025 Maria Fernanda Davila Restrepo, Benjamin Wollmer, Fabian Panse, Wolfram Wingerath, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 24 (2025): Issue 0