Have a personal or library account? Click to login
Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users Cover

Benchmarking Tabular Data Synthesis: Evaluating Tools, Metrics, and Datasets on Prosumer Hardware for End-Users

Open Access
|Dec 2025

Figures & Tables

dsj-24-2037-g1.png
Figure 1

Taxonomy of TDS models (illustration taken from Davila et al. (2025, Fig. 1)).

Table 1

13 tools chosen for the benchmark.

MODELTDS TOOL
SamplingSMOTE (Cantalupo, 2021; Lemaître, Nogueira, and Aridas, 2017)
Bayesian NetworksPrivBayes (Zhang et al., 2017)
GANCTGAN (Xu et al., 2019), CTAB-GAN+ (Zhao et al., 2021), GANBLR++ (Zhang et al., 2022)
VAETVAE (Xu et al., 2019)
Diffusion (DPM)TabDDPM (Kotelnikov et al., 2023)
Graph NNGOGGLE (Liu et al., 2023)
TransformerGReaT (Borisov et al., 2023), REalTabFormer (Solatorio and Dupriez, 2023), TabuLa (Zhao, Birke, and Chen, 2025)
HybridAutoDiff (Suh et al., 2023), TabSyn (Zhang et al., 2024)
Table 2
DATASETCOLUMN NUMBERROW NUMBERCATEGORICAL COLUMNSCONTINUOUS COLUMNSMIXED COLUMNSML TASK
abalone (Nash et al., 1994)84177170Regression
adult (Becker and Kohavi, 1996)1448842932Classification
airline (Banerjee, 2016)1050000820Regression
california (Nugent, n.d.)520433140Regression
cardio (Janosi et al., 1989)1270000750Classification
churn2 (BlastChar, 2017)1210000561Classification
diabetes (Kahn, n.d.)9768270Classification
higgs-small (Whiteson, 2014)29627511280Classification
house (Torgo, 2014)16227840160Regression
insurance (Kumar, 2020)61338330Regression
king (harlfoxem, 2016)19216137102Regression
loan (Quinlan, 1987)125000660Classification
miniboone-small (Roe, 2005)51500001500Classification
payroll-small (City of Los Angeles, 2013)1250000480Regression
wilt (Johnson, 2013)64339150Classification
Table 3
EVALUATIONCHALLENGEEVALUATION FOCUSMETRICS
Dataset ImbalanceEnsuring that the tool is able to capture the real column distributions, even though there are imbalances in the classesClass distribution alignmentContinuous: Wasserstein Distance, KS Statistic, Correlation Differences. Categorical: Jensen–Shannon Divergence, KL Divergence, Percentage Class Count Difference
Data AugmentationGuaranteeing that the synthetic data generated remains realistic and meaningfulSimilarity and meaningful variability of new data pointsContinuous: Wasserstein Distance, KS Statistic, Correlation Differences, Quantile Comparison. Categorical: Jensen–Shannon Divergence, Percentage Number of Classes Difference
Missing ValuesMaking certain the tools are able to capture the key characteristics of the real dataset, even if it includes different levels of missing valuesSimilarity to original distributionsContinuous: Wasserstein Distance, Quantile Comparison. Categorical: Jensen-Shannon Divergence, Percentage Class Count Difference, Percentage Number of Classes Difference
PrivacyEnsuring whether the tools can generate truly synthetic data points rather than replicating the original data, which could potentially expose sensitive informationResemblance of synthetic records and real data and anonymity levelsDistance to Closest Record (DCR), Nearest Neighbor Distance Ratio (NNDR)
Machine Learning UtilityEnabling effective ML training with synthetic dataSynthetic datasets used to train ML models for classification and regression tasksAccuracy, Area Under the Receiver Operating Characteristic Curve (AUC), F1 Score (micro, macro, weighted), Explained Variance Score, Mean Absolute Error (MAE), and R2 Score.
PerformanceEnsuring synthetic data is generated within reasonable time frames while minimizing computational resource usage and maintaining scalabilityMeasure the computational resource usage and time required for data generationCPU Usage, GPU Usage, Memory Usage, Total Runtime
Table 4

Normalization of datasets in the original experiments of different TDS tools.

TOOLNORMALIZATION STRATEGY
PrivBayesOnly discrete columns, no normalization
CTGAN / TVAEMode-specific normalization applied to all X
GANBLR++Ordinal encoding for all columns; numerical treated as discrete
TabDDPMNormalization of complete X (per code in data.py)
GOGGLENo normalization (raw tensors from get_dataloader)
GReaTNo normalization; textual encoding of all columns
REaLTabFormerNumerical columns normalized into fixed-length, digit-aligned string tokens
TabuLaNo normalization; continuous values directly as text tokens
AutoDiffAll numerical X normalized (Stasy: min–max; Tab: Gaussian quantile).
TabSynZ-score normalization of all numerical X
dsj-24-2037-g2.png
Figure 2

Overview of the benchmark architecture, which automates configuration and analysis, ensuring reproducibility and consistency across tools and datasets.

dsj-24-2037-g3.png
Figure 3

Heatmap showing the correlation difference for various TDS tools and use cases, where zero indicates perfect preservation of inter-column correlations, and one represents the maximum difference. Empty cells denote cases where no dataset was generated due to intentional resource constraints.

dsj-24-2037-g4.png
Figure 4

Distribution comparison for Wilt, with one row per TDS tool. For continuous columns, real data is in blue and synthetic in red. For categorical columns, bars show class counts with real data in blue and synthetic in red. The five synthetic distributions occasionally overlap completely. GOGGLE collapsed and therefore shows exploding densities.

Table 5

Dataset imbalance evaluation for all tools, averaged across datasets and normalized using Min–Max scaling, highlighting the top-performing tools (SMOTE, REalTabFormer, and TabSyn).

TOOLWASSERSTEIN DISTANCEKS STATISTICCORRELATION DIFFERENCEJS DIVERGENCEKL DIVERGENCEPERCENTAGE COUNT DIFFERENCE
AutoDiff0.2180.1640.1460.2110.1340.154
CTAB-GAN+0.3240.4160.2650.3440.3350.117
CTGAN0.2860.4820.3080.3820.4110.528
GANBLR++0.6920.7480.5420.7140.6590.489
GReaT0.5760.6740.2090.6530.1890.246
REalTabFormer0.1130.2940.0410.1850.1380.002
SMOTE0.0630.0620.1760.0150.0190.198
TabDDPM0.3440.3250.0530.2940.2760.861
TabSyn0.0400.1280.3240.0940.1090.176
TabuLaMiddle0.5130.6410.1830.5880.2010.233
TVAE0.3170.5430.3270.4230.3460.414
Table 6

Augmentation evaluation results for all tools averaged across datasets and normalized using Min–Max scaling. The top-performing tools: TabSyn, SMOTE, REalTabFormer, and TabDDPM.

TOOLWASSERSTEIN DISTANCEKS STATISTICCORRELATION DIFFERENCEQUANTILE COMPARISONJS DIVERGENCEPERCENTAGE NUMBER CLASSES DIFFERENCE
AutoDiff0.2180.1640.1460.1970.2110.028
CTAB-GAN+0.3240.4160.2650.2870.3440.045
CTGAN0.2860.4160.2650.2260.3820.014
GANBLR++0.6920.7480.5420.6700.7140.096
GReaT0.5760.6740.2090.5430.6530.245
REalTabFormer0.1130.2940.0410.0920.1850.021
SMOTE0.0630.0620.1760.0710.0150.126
TabDDPM0.3440.3250.0530.2590.2940.006
TabSyn0.0400.1280.3240.0660.0940.123
TabuLaMiddle0.5130.6410.1830.4650.5880.332
TVAE0.3170.5430.3270.2870.4230.069
dsj-24-2037-g5.png
Figure 5

Distribution comparison for the Diabetes dataset, with 5%, 10% and 20% missing values. The real data distribution is plotted in blue, and synthetic data in red, green and orange. Continuous columns as density plots and categorical columns as bar plots.

Table 7

Missing-values evaluation results normalized using Min–Max scaling.

TOOLWASSERSTEIN DISTANCEKS STATISTICCORRELATION DIFFERENCEQUANTILE COMPARISONJS DIVERGENCEKL DIVERGENCEPERCENTAGE COUNT DIFFERENCEPERCENTAGE NUM CLASSES DIFFERENCE
(a) 5% Missing Values
AutoDiff0.2490.1780.6210.1940.3990.0510.0890.000
CTAB-GAN+0.4120.4120.2170.4080.4200.5450.0190.093
CTGAN0.8290.9300.6690.8550.8460.8790.7220.061
GReaT0.7440.6810.5780.7310.8430.0910.4980.138
REalTabFormer0.2510.4020.1210.2780.4600.5600.0040.147
SMOTE0.0610.0720.1820.0620.0480.0510.3760.000
TabSyn0.0740.1120.4180.0520.1230.0300.0451.000
TabuLaMiddle0.3620.4380.0910.4010.5310.0180.4520.055
TVAE0.8300.8790.3780.8350.8420.4220.4780.186
(b) 10% Missing Values
AutoDiff0.3680.1820.6420.3190.4150.0740.0690.000
CTAB-GAN+0.7980.6050.4710.7990.6350.6690.1470.092
CTGAN0.7250.9630.5230.7530.8720.8750.7340.034
GReaT0.7530.7110.5830.7350.8870.0770.4950.167
REalTabFormer0.2690.4360.1080.2630.4690.5310.0040.165
SMOTE0.0670.0650.1810.0290.0430.0470.3680.000
TabSyn0.0720.0920.0000.0380.0670.0750.0420.000
TabuLaMiddle0.3480.3980.0980.3820.3840.0150.4370.097
TVAE0.7350.9840.5370.7890.8040.3990.4590.261
(c) 20% Missing Values
AutoDiff0.4050.1840.5890.3850.3860.0580.0810.000
CTAB-GAN+0.3780.4420.2880.4190.4200.3560.0310.092
CTGAN0.9111.0000.6620.9760.9250.9040.7320.042
GReaT0.7690.6780.5630.7820.8630.0710.5010.129
REalTabFormer0.2670.4350.1280.3360.4920.4330.0050.138
SMOTE0.0490.0820.1620.0710.0510.0490.3780.000
TabSyn0.5020.9720.2310.5330.8350.0610.0210.033
TabuLaMiddle0.3340.4110.1160.4060.4660.5200.4510.057
TVAE0.6470.8930.3030.6580.7500.3110.4790.148
Table 8

Privacy results for all tools averaged across datasets and normalized using Min-Max scaling, highlighting the top-performing tools (CTAB-GAN+, REalTabFormer, and TabDDPM). The best score is one, and the worst is zero.

TOOLDISTANCE TO CLOSEST RECORDNEAREST NEIGHBOR DISTANCE RATIO
AutoDiff0.1620.847
CTAB-GAN+0.1390.870
CTGAN0.2300.844
GReaT0.0460.751
REalTabFormer0.2830.781
SMOTE0.0320.342
TabSyn0.2610.750
TabDDPM0.3290.810
TabuLaMiddle0.0620.438
TVAE0.1010.837
Table 9

Classifiers and regressors used for the ML utility evaluation.

CLASSIFICATION METHODSREGRESSION METHODS
Decision TreesBayesian Ridge Regression
Gaussian Naive Bayes (NB)Lasso Regression
K-Nearest Neighbors (KNN)Linear Regression
Linear Support Vector Machine (SVM)Ridge Regression
Logistic Regression
Multilayer Perceptron (MLP)
Perceptron
Random Forest
Radial Support Vector Machine (SVM)
Table 10

ML utility results showing the difference between the model’s performance trained with real datasets and trained with synthetic datasets. For all metrics, if the difference is negative, the model performed better with synthetic data than with real data. If the difference is positive, the model performed worse.

TOOLACCURACYAUCF1 MICROF1 MACROF1 WEIGHTEDEVSINVERSE MAER2 SCORE
AutoDiff–0.0720.010–0.070–0.009–0.0800.0380.0090.062
CTAB-GAN+–0.0520.074–0.0460.0710.0530.1760.1450.172
CTGAN0.0890.0460.0810.0530.0930.1930.4720.192
GANBLR++0.1180.1520.1210.1560.1370.5480.3420.546
GReaT0.0430.0510.0400.0700.0330.3950.3980.665
REalTabFormer–0.024–0.029–0.020–0.028–0.0230.003–0.0260.003
SMOTE–0.011–0.014–0.013–0.017–0.0160.1430.0280.145
TabDDPM0.019–0.0310.017–0.0230.0100.0340.1820.035
TabSyn–0.046–0.041–0.043–0.040–0.0780.018–0.0350.017
TabuLaMiddle–0.063–0.072–0.059–0.052–0.0800.2040.0660.011
TVAE–0.024–0.015–0.021–0.006–0.0250.2100.1590.346
Table 11

Performance results for all tools averaged across datasets. Top-performing tools are SMOTE and PrivBayes; REalTabFormer shows the longest runtime and high resource usage.

TOOLMEAN CPU (%)MAX CPU (%)MEAN MEMORY (%)MAX MEMORY (%)MEAN GPU (%)MAX GPU (%)RUNTIME (S)
AutoDiff142415167373240
CTAB-GAN+1828783456863
CTGAN135913253236572
GANBLR++27881619001790
GOGGLE45562324001106
GReaT922131477863714
PrivBayes1732181900106
REalTabFormer1037213928777682
SMOTE2538171700248
TabDDPM113619202031666
TabSyn172519221357317
TabuLaMiddle2533232376901097
TVAE129017221315465
dsj-24-2037-g6.png
Figure 6

Spider web diagram summarizing model performance across the six benchmark dimensions. The scores are calculated by aggregating each dimension and are scaled 1–10.

Table 12

TDS tools and required packages.

TDS TOOLPACKAGE REQUIREMENTS
SMOTEpandas==2.2.2, numpy==2.0.0, scikit-learn==1.5.2, imbalanced-learn==0.13.0
PrivBayesdiffprivlib==0.6.3, dill==0.3.7, dython==0.6.8, joblib==1.2.0, lifelines==0.27.8, matplotlib==3.7.2, numpy==1.26.0, pandas==1.3.4, pyjanitor==0.26.0, pandas_flavor==0.6.0, scikit_learn==1.3.0, scipy==1.11.3, seaborn==0.13.0, thomas_core==0.1.3, synthetic-data-generation, torch, gpustat
CTGANtqdm==4.66.5, torch==2.1.0, numpy==2.0.0, pandas==2.2.2, scikit-learn==1.5.2, ctgan, joblib==1.4.2, rdt==1.7.0
CTAB-GAN+numpy==1.21.0, torch==1.10.0+cu113, torchvision==0.11.1+cu113, torchaudio==0.10.0+cu113, pandas==1.2.1, scikit-learn==0.24.1, dython==0.6.4.post1, scipy, gpustat, tqdm, -f https://download.pytorch.org/whl/torch_stable.html
GANBLR++ganblr
TVAEtqdm==4.66.5, torch==2.1.0, numpy==2.0.0, pandas==2.2.2, scikit-learn==1.5.2, ctgan, joblib==1.4.2, rdt==1.7.0
TabDDPMcatboost==1.0.3, category-encoders==2.3.0, dython==0.5.1, icecream==2.1.2, libzero==0.0.8, numpy==1.21.4, optuna==2.10.1, pandas==1.3.4, pyarrow==6.0.0, rtdl==0.0.9, scikit-learn==1.0.2, scipy==1.7.2, skorch==0.11.0, tomli-w==0.4.0, tomli==1.2.2, tqdm==4.62.3
GOGGLEchardet==5.1.0, cvxpy==1.1, dgl==0.9.0, geomloss==0.2.5, matplotlib==3.7.0, numpy==1.23.0, packaging==21.3, pandas==1.4.3, pgmpy==0.1.21, scikit-learn==1.1.1, seaborn==0.12.2, synthcity==0.2.2, torch==1.12.0, torch-geometric==2.2.0, torch-sparse==0.6.16, torch_scatter==2.1.0
GReaTdatasets≥2.5.2, numpy≥1.23.1, pandas≥1.4.4, scikit_learn≥1.1.1, torch≥1.10.2, tqdm≥4.64.1, transformers≥4.22.1, accelerate≥0.20.1
REalTabFormertorch, bandit≥1.6.2,<2.0, black~=22.0, build~=0.9.0, import-linter[toml]==1.2.6, openpyxl~=3.0.10, pre-commit≥2.9.2,<3.0, pylint≥2.5.2,<3.0, pytest-cov~=3.0, pytest-mock≥1.7.1,<2.0, pytest-xdist[psutil]~=2.2.1, pytest~=6.2, trufflehog~=2.1, twine~=4.0.1, pandas, datasets, scikit-learn, transformers, realtabformer
TabuLadatasets≥2.5.2, numpy≥1.24.2, pandas≥1.4.4, scikit_learn≥1.1.1, torch≥1.10.2, tqdm≥4.64.1, transformers≥4.22.1
AutoDiffnumpy==2.0.0, pandas==2.2.2, scikit-learn==1.5.2, scipy==1.10.1, torch==2.1.0, gpustat==1.0.0, psutil==5.9.4, tqdm==4.65.0, ipywidgets==7.8.5, jupyter==1.0.0, matplotlib==3.7.1
TabSynnumpy==2.0.0, pandas==2.2.2, scikit-learn==1.5.2, scipy==1.10.1, torch==2.1.0, icecream==2.1.2, category_encoders==2.3.0, imbalanced-learn==0.14.0, transformers==4.25.0, datasets==2.8.0, openpyxl==3.1.2, xgboost==1.7.5
Language: English
Submitted on: Jul 7, 2025
Accepted on: Nov 19, 2025
Published on: Dec 9, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Maria Fernanda Davila Restrepo, Benjamin Wollmer, Fabian Panse, Wolfram Wingerath, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.