References
- 1Ang, Y., Huang, Q., Bao, Y., Tung, A.K.H. and Huang, Z. (2023) ‘TSGBench: Time Series Generation Benchmark’, Proc VLDB Endow, 17(3), pp. 305–318. Available at: 10.14778/3632093.3632097
- 2Arjovsky, M., Chintala, S. and Bottou, L. (2017) ‘Wasserstein GAN’. Available at:
https://arxiv.org/abs/1701.07875 . - 3Banerjee, S. (2016) ‘Airline Dataset’. Available at:
https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset . - 4Becker, B. and Kohavi, R. (1996)
‘Adult’ . UCI Machine Learning Repository. Available at: 10.24432/C5XW20 - 5BlastChar (2017) ‘Customer Churn Dataset’. Available at:
https://www.kaggle.com/datasets/blastchar/telco-customer-churn . - 6Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M. and Kasneci, G. (2023) ‘Language models are realistic tabular data generators’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–18. Available at:
https://openreview.net/forum?id=cEygmQNOeI . - 7Brandt, J. and Lanzén, E. (n.d.) ‘A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification’, Probability Theory and Statistics. Available at:
https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162 . - 8Bruno, N. and Chaudhuri, S. (2005) ‘Flexible Database Generators’, Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1097–1107. Available at:
http://www.vldb.org/archives/website/2005/program/paper/wed/p1097-bruno.pdf . - 9Cantalupo, S. (2021) ‘SMOGN: Synthetic Minority Oversampling TEchnique for Regression with Gaussian Noise’. Available at:
https://github.com/nickkunz/smogn , gitHub repository. - 10Casola, S., Lauriola, I. and Lavelli, A. (2022) ‘Pre-trained transformers: an empirical comparison’, Machine Learning with Applications, 9, p.
100334 . Available at: 10.1016/j.mlwa.2022.100334 - 11Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M. and Narang, P. (2024) ‘A universal metric for robust evaluation of synthetic tabular data’, IEEE Transactions on Artificial Intelligence, 5(1), pp. 300–309. Available at: 10.1109/TAI.2022.3229289
- 12City of Los Angeles (2013) ‘City Payroll Dataset’. Available at:
https://www.kaggle.com/datasets/cityofLA/city-payroll-data . - 13Davila, R.M.F., Groen, S., Panse, F. and Wingerath, W. (2025) ‘Navigating tabular data synthesis research understanding user needs and tool capabilities’, SIGMOD Record, 53(4), pp. 18–35. Available at: 10.1145/3712311.3712315
- 14Davila, R.M.F., Turaev, A. and Wingerath, W. (2025) ‘Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis’. Available at:
https://arxiv.org/abs/2509.20768 . - 15Dua, D. and Graff, C. (2017) ‘UCI machine learning repository’. Available at:
https://archive.ics.uci.edu/ml . - 16European Parliament (2023) ‘Boosting data sharing in the EU: what are the benefits?’. Available at:
https://www.europarl.europa.eu/news/en/headlines/society/20220331STO26411/boosting-data-sharing-in-the-eu-what-are-the-benefits (Accessed: 2024-10-30). - 17Ge, C., Mohapatra, S., He, X. and Ilyas, I.F. (2021) ‘Kamino: constraint-aware differentially private data synthesis’, Proceedings of the VLDB Endowment (PVLDB), 14(10), pp. 1886–1899. Available at: 10.14778/3467861.3467876
- 18Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L. and Sales, A.P. (2020) ‘Generation and evaluation of synthetic patient data’, BMC Medical Research Methodology, 20, p.
108 . Available at: 10.1186/s12874-020-00977-1 - 19Gray, J., Sundaresan, P., Englert, S., Baclawski, K. and Weinberger, P.J. (1994) ‘Quickly generating billion-record synthetic databases’, Proceedings of the International Conference on Management of Data (SIGMOD), pp. 243–252. Available at: 10.1145/191839.191886
- 20harlfoxem, K.C. (2016) ‘House sales in King County dataset’. Available at:
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction . - 21Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022) ‘Synthetic data generation for tabular health records: A systematic review’, Neurocomputing, 493, pp. 28–45. Available at: 10.1016/j.neucom.2022.04.053
- 22Janosi, A., Steinbrunn, W., Pfisterer, M. and Detrano, R. (1989) ‘Heart Disease’, UCI machine learning repository. Available at: 10.24432/C52P4X
- 23Johnson, B. (2013) ‘Wilt’, UCI machine learning repository. Available at: 10.24432/C5KS4M
- 24Kahn, M. (n.d.) ‘Diabetes’, UCI machine learning repository. Available at: 10.24432/C5T59G
- 25Kim, J., Lee, C. and Park, N. (2023) ‘STaSy: Score-based Tabular data Synthesis’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–27. Available at:
https://openreview.net/pdf/7cc08c44de490f3e79794b5827aa36b84f99c4c3.pdf . - 26Kotelnikov, A., Baranchuk, D., Rubachev, I. and Babenko, A. (2023) ‘TabDDPM: modelling tabular data with diffusion models’, Proceedings of the International Conference on Machine Learning (ICML), pp. 17564–17579. Available at:
https://proceedings.mlr.press/v202/kotelnikov23a.html . - 27Kumar, H. (2020) ‘Medical insurance price prediction dataset’. Available at:
https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction/data . - 28Lemaître, G., Nogueira, F. and Aridas, C.K. (2017) ‘Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning’, Proceedings of the JMLR Workshop of the International Conference on Machine Learning (ICML), 18(17), pp. 1–5. Available at:
http://jmlr.org/papers/v18/16-365 . - 29Liu, T., Qian, Z., Berrevoets, J. and van der Schaar, M. (2023) ‘GOGGLE: Generative modelling for tabular data by learning relational structure’, Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–22. Available at:
https://openreview.net/pdf?id=fPVRcJqspu . - 30LLC, G. (2010) ‘Kaggle: data science platform and datasets’. Available at:
https://www.kaggle.com . - 31Mescheder, L., Geiger, A. and Nowozin, S. (2018) ‘Which training methods for GANs do actually converge?’. Available at:
https://arxiv.org/abs/1801.04406 . - 32Nash, W., Sellers, T., Talbot, S., Cawthorn, A. and Ford, W. (1994) ‘Abalone’, UCI machine learning repository. Available at: 10.24432/C55C7W
- 33Neufeld, A., Moerkotte, G. and Lockemann, P.C. (1993) ‘Generating consistent test data for a variable set of general consistency constraints’, VLDB Journal, 2(2), pp. 173–213. Available at:
http://www.vldb.org/journal/VLDBJ2/P172.pdf . - 34Nugent, C. (n.d.) ‘California housing prices’. Available at:
https://www.kaggle.com/datasets/camnugent/california-housing-prices . - 35Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018) ‘Data Synthesis based on generative adversarial networks’, Proceedings of the VLDB Endowment (PVLDB), 11(10), pp. 1071–1083. Available at: 10.14778/3231751.3231757
- 36Patki, N., Wedge, R. and Veeramachaneni, K. (2016) ‘The synthetic data vault’, Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. Available at: 10.1109/DSAA.2016.49
- 37Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011) ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, pp. 2825–2830
- 38Qian, Z., Davis, R. and van der Schaar, M. (2023) ‘Synthcity: A benchmark framework for diverse use cases of tabular synthetic data’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS).
- 39Quinlan, J.R. (1987) ‘Credit Approval’, UCI machine learning repository. Available at: 10.24432/C5FS30
- 40Roe, B. (2005) ‘MiniBooNE particle identification’, UCI machine learning repository. Available at: 10.24432/C5QC87
- 41Santangelo, A. and Others (2025) ‘SynthRO: A dashboard-based benchmarking framework for health-related synthetic tabular data’, Proceedings of the International Conference on Artificial Intelligence in Medicine (AIME), To appear.
- 42SciPy (2024a) Jensen-Shannon Divergence — SciPy v1.10.1 Manual.
- 43SciPy (2024b) Kolmogorov-Smirnov Test — SciPy v1.10.1 Manual.
- 44SciPy (2024c) Kullback-Leibler Divergence — SciPy v1.10.1 Manual.
- 45SciPy (2024d) Wasserstein Distance — SciPy v1.10.1 Manual.
- 46Solatorio, A.V. and Dupriez, O. (2023) ‘REaLTabFormer: Generating realistic relational and tabular data using transformers’, CoRR, abs/2302.02041, pp. 1–17. Available at: 10.48550/ARXIV.2302.02041.
- 47Suh, N., Lin, X., Hsieh, D., Honarkhah, M. and Cheng, G. (2023) ‘AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing’, CoRR, abs/2310.15479, pp. 1–12. Available at: 10.48550/ARXIV.2310.15479.
- 48Torgo, L. (2014) ‘House Dataset’. Available at:
https://www.openml.org/search?type=data&sort=runs&id=574&status=active . - 49Whiteson, D. (2014) ‘HIGGS’, UCI machine learning repository. Available at: 10.24432/C5V312
- 50Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019) ‘Modeling tabular data using conditional GAN’, Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 7333–7343. Available at:
https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html . - 51Zhang, H., Zhang, J., Shen, Z., Srinivasan, B., Qin, X., Faloutsos, C., Rangwala, H. and Karypis, G. (2024) ‘Mixed-type tabular data synthesis with score-based diffusion in latent space’, Proceedings of the International Conference on Learning Representations (ICLR). Available at:
https://openreview.net/forum?id=4Ay23yeuz0 . - 52Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D. and Xiao, X. (2017) ‘PrivBayes: Private data release via Bayesian networks’, ACM Transactions on Database Systems, 42(4), pp. 25:1–25:41. Available at: 10.1145/3134428
- 53Zhang, Y., Zaidi, N.A., Zhou, J. and Li, G. (2022) ‘GANBLR++: Incorporating capacity to generate numeric attributes and leveraging unrestricted Bayesian networks’, Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 298–306. Available at: 10.1137/1.9781611977172.34
- 55Zhao, Z., Birke, R. and Chen, L.Y. (2025)
‘TabuLa: Harnessing language models for tabular data synthesis’ , Advances in Knowledge Discovery and Data Mining. PAKDD. Berlin, Heidelberg: Springer-Verlag, pp. 247–259. Available at: 10.1007/978-981-96-8186-0_20 - 54Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021) ‘CTAB-GAN: Effective table data synthesizing’, Proceedings of the Asian Conference on Machine Learning (ACML), pp. 97–112. Available at:
https://proceedings.mlr.press/v157/zhao21a.html . - 56Zhao, Z., Kunar, A., Birke, R., der Scheer, H.V. and Chen, L.Y. (2024) ‘CTAB-GAN+: enhancing tabular data synthesis’, Frontiers in Big Data, 6. Available at: 10.3389/fdata.2023.1296508
