Have a personal or library account? Click to login
Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users Cover

Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users

Open Access
|Dec 2025

References

  1. 1Ang, Y., Huang, Q., Bao, Y., Tung, A.K.H. and Huang, Z. (2023) ‘TSGBench: Time Series Generation Benchmark’, Proc VLDB Endow, 17(3), pp. 305318. Available at: 10.14778/3632093.3632097
  2. 2Arjovsky, M., Chintala, S. and Bottou, L. (2017) ‘Wasserstein GAN’. Available at: https://arxiv.org/abs/1701.07875.
  3. 3Banerjee, S. (2016) ‘Airline Dataset’. Available at: https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset.
  4. 4Becker, B. and Kohavi, R. (1996) ‘Adult’. UCI Machine Learning Repository. Available at: 10.24432/C5XW20
  5. 5BlastChar (2017) ‘Customer Churn Dataset’. Available at: https://www.kaggle.com/datasets/blastchar/telco-customer-churn.
  6. 6Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M. and Kasneci, G. (2023) ‘Language models are realistic tabular data generators’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 118. Available at: https://openreview.net/forum?id=cEygmQNOeI.
  7. 7Brandt, J. and Lanzén, E. (n.d.) ‘A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification’, Probability Theory and Statistics. Available at: https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162.
  8. 8Bruno, N. and Chaudhuri, S. (2005) ‘Flexible Database Generators’, Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 10971107. Available at: http://www.vldb.org/archives/website/2005/program/paper/wed/p1097-bruno.pdf.
  9. 9Cantalupo, S. (2021) ‘SMOGN: Synthetic Minority Oversampling TEchnique for Regression with Gaussian Noise’. Available at: https://github.com/nickkunz/smogn, gitHub repository.
  10. 10Casola, S., Lauriola, I. and Lavelli, A. (2022) ‘Pre-trained transformers: an empirical comparison’, Machine Learning with Applications, 9, p. 100334. Available at: 10.1016/j.mlwa.2022.100334
  11. 11Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M. and Narang, P. (2024) ‘A universal metric for robust evaluation of synthetic tabular data’, IEEE Transactions on Artificial Intelligence, 5(1), pp. 300309. Available at: 10.1109/TAI.2022.3229289
  12. 12City of Los Angeles (2013) ‘City Payroll Dataset’. Available at: https://www.kaggle.com/datasets/cityofLA/city-payroll-data.
  13. 13Davila, R.M.F., Groen, S., Panse, F. and Wingerath, W. (2025) ‘Navigating tabular data synthesis research understanding user needs and tool capabilities’, SIGMOD Record, 53(4), pp. 1835. Available at: 10.1145/3712311.3712315
  14. 14Davila, R.M.F., Turaev, A. and Wingerath, W. (2025) ‘Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis’. Available at: https://arxiv.org/abs/2509.20768.
  15. 15Dua, D. and Graff, C. (2017) ‘UCI machine learning repository’. Available at: https://archive.ics.uci.edu/ml.
  16. 16European Parliament (2023) ‘Boosting data sharing in the EU: what are the benefits?’. Available at: https://www.europarl.europa.eu/news/en/headlines/society/20220331STO26411/boosting-data-sharing-in-the-eu-what-are-the-benefits (Accessed: 2024-10-30).
  17. 17Ge, C., Mohapatra, S., He, X. and Ilyas, I.F. (2021) ‘Kamino: constraint-aware differentially private data synthesis’, Proceedings of the VLDB Endowment (PVLDB), 14(10), pp. 18861899. Available at: 10.14778/3467861.3467876
  18. 18Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L. and Sales, A.P. (2020) ‘Generation and evaluation of synthetic patient data’, BMC Medical Research Methodology, 20, p. 108. Available at: 10.1186/s12874-020-00977-1
  19. 19Gray, J., Sundaresan, P., Englert, S., Baclawski, K. and Weinberger, P.J. (1994) ‘Quickly generating billion-record synthetic databases’, Proceedings of the International Conference on Management of Data (SIGMOD), pp. 243252. Available at: 10.1145/191839.191886
  20. 20harlfoxem, K.C. (2016) ‘House sales in King County dataset’. Available at: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.
  21. 21Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022) ‘Synthetic data generation for tabular health records: A systematic review’, Neurocomputing, 493, pp. 2845. Available at: 10.1016/j.neucom.2022.04.053
  22. 22Janosi, A., Steinbrunn, W., Pfisterer, M. and Detrano, R. (1989) ‘Heart Disease’, UCI machine learning repository. Available at: 10.24432/C52P4X
  23. 23Johnson, B. (2013) ‘Wilt’, UCI machine learning repository. Available at: 10.24432/C5KS4M
  24. 24Kahn, M. (n.d.) ‘Diabetes’, UCI machine learning repository. Available at: 10.24432/C5T59G
  25. 25Kim, J., Lee, C. and Park, N. (2023) ‘STaSy: Score-based Tabular data Synthesis’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 127. Available at: https://openreview.net/pdf/7cc08c44de490f3e79794b5827aa36b84f99c4c3.pdf.
  26. 26Kotelnikov, A., Baranchuk, D., Rubachev, I. and Babenko, A. (2023) ‘TabDDPM: modelling tabular data with diffusion models’, Proceedings of the International Conference on Machine Learning (ICML), pp. 1756417579. Available at: https://proceedings.mlr.press/v202/kotelnikov23a.html.
  27. 27Kumar, H. (2020) ‘Medical insurance price prediction dataset’. Available at: https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction/data.
  28. 28Lemaître, G., Nogueira, F. and Aridas, C.K. (2017) ‘Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning’, Proceedings of the JMLR Workshop of the International Conference on Machine Learning (ICML), 18(17), pp. 15. Available at: http://jmlr.org/papers/v18/16-365.
  29. 29Liu, T., Qian, Z., Berrevoets, J. and van der Schaar, M. (2023) ‘GOGGLE: Generative modelling for tabular data by learning relational structure’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 122. Available at: https://openreview.net/pdf?id=fPVRcJqspu.
  30. 30LLC, G. (2010) ‘Kaggle: data science platform and datasets’. Available at: https://www.kaggle.com.
  31. 31Mescheder, L., Geiger, A. and Nowozin, S. (2018) ‘Which training methods for GANs do actually converge?’. Available at: https://arxiv.org/abs/1801.04406.
  32. 32Nash, W., Sellers, T., Talbot, S., Cawthorn, A. and Ford, W. (1994) ‘Abalone’, UCI machine learning repository. Available at: 10.24432/C55C7W
  33. 33Neufeld, A., Moerkotte, G. and Lockemann, P.C. (1993) ‘Generating consistent test data for a variable set of general consistency constraints’, VLDB Journal, 2(2), pp. 173213. Available at: http://www.vldb.org/journal/VLDBJ2/P172.pdf.
  34. 34Nugent, C. (n.d.) ‘California housing prices’. Available at: https://www.kaggle.com/datasets/camnugent/california-housing-prices.
  35. 35Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018) ‘Data Synthesis based on generative adversarial networks’, Proceedings of the VLDB Endowment (PVLDB), 11(10), pp. 10711083. Available at: 10.14778/3231751.3231757
  36. 36Patki, N., Wedge, R. and Veeramachaneni, K. (2016) ‘The synthetic data vault’, Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399410. Available at: 10.1109/DSAA.2016.49
  37. 37Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011) ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, pp. 28252830
  38. 38Qian, Z., Davis, R. and van der Schaar, M. (2023) ‘Synthcity: A benchmark framework for diverse use cases of tabular synthetic data’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS).
  39. 39Quinlan, J.R. (1987) ‘Credit Approval’, UCI machine learning repository. Available at: 10.24432/C5FS30
  40. 40Roe, B. (2005) ‘MiniBooNE particle identification’, UCI machine learning repository. Available at: 10.24432/C5QC87
  41. 41Santangelo, A. and Others (2025) ‘SynthRO: A dashboard-based benchmarking framework for health-related synthetic tabular data’, Proceedings of the International Conference on Artificial Intelligence in Medicine (AIME), To appear.
  42. 42SciPy (2024a) Jensen-Shannon Divergence — SciPy v1.10.1 Manual.
  43. 43SciPy (2024b) Kolmogorov-Smirnov Test — SciPy v1.10.1 Manual.
  44. 44SciPy (2024c) Kullback-Leibler Divergence — SciPy v1.10.1 Manual.
  45. 45SciPy (2024d) Wasserstein Distance — SciPy v1.10.1 Manual.
  46. 46Solatorio, A.V. and Dupriez, O. (2023) ‘REaLTabFormer: Generating realistic relational and tabular data using transformers’, CoRR, abs/2302.02041, pp. 117. Available at: 10.48550/ARXIV.2302.02041.
  47. 47Suh, N., Lin, X., Hsieh, D., Honarkhah, M. and Cheng, G. (2023) ‘AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing’, CoRR, abs/2310.15479, pp. 112. Available at: 10.48550/ARXIV.2310.15479.
  48. 48Torgo, L. (2014) ‘House Dataset’. Available at: https://www.openml.org/search?type=data&sort=runs&id=574&status=active.
  49. 49Whiteson, D. (2014) ‘HIGGS’, UCI machine learning repository. Available at: 10.24432/C5V312
  50. 50Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019) ‘Modeling tabular data using conditional GAN’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 73337343. Available at: https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html.
  51. 51Zhang, H., Zhang, J., Shen, Z., Srinivasan, B., Qin, X., Faloutsos, C., Rangwala, H. and Karypis, G. (2024) ‘Mixed-type tabular data synthesis with score-based diffusion in latent space’, Proceedings of the International Conference on Learning Representations (ICLR). Available at: https://openreview.net/forum?id=4Ay23yeuz0.
  52. 52Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D. and Xiao, X. (2017) ‘PrivBayes: Private data release via Bayesian networks’, ACM Transactions on Database Systems, 42(4), pp. 25:125:41. Available at: 10.1145/3134428
  53. 53Zhang, Y., Zaidi, N.A., Zhou, J. and Li, G. (2022) ‘GANBLR++: Incorporating capacity to generate numeric attributes and leveraging unrestricted Bayesian networks’, Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 298306. Available at: 10.1137/1.9781611977172.34
  54. 55Zhao, Z., Birke, R. and Chen, L.Y. (2025) ‘TabuLa: Harnessing language models for tabular data synthesis’, Advances in Knowledge Discovery and Data Mining. PAKDD. Berlin, Heidelberg: Springer-Verlag, pp. 247259. Available at: 10.1007/978-981-96-8186-0_20
  55. 54Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021) ‘CTAB-GAN: Effective table data synthesizing’, Proceedings of the Asian Conference on Machine Learning (ACML), pp. 97112. Available at: https://proceedings.mlr.press/v157/zhao21a.html.
  56. 56Zhao, Z., Kunar, A., Birke, R., der Scheer, H.V. and Chen, L.Y. (2024) ‘CTAB-GAN+: enhancing tabular data synthesis’, Frontiers in Big Data, 6. Available at: 10.3389/fdata.2023.1296508
Language: English
Submitted on: Jul 7, 2025
Accepted on: Nov 19, 2025
Published on: Dec 9, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Maria Fernanda Davila Restrepo, Benjamin Wollmer, Fabian Panse, Wolfram Wingerath, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.