Abstract
Diabetes is a serious global health problem that affects millions of people and leads to many complications if not diagnosed early. Early and accurate diagnosis is very important for improving patient outcomes and reducing healthcare costs. Machine learning can help to analyze medical data and predict diabetes more effectively. This study compares three machine learning models – logistic regression, random forests, and XGBoost – for predicting diabetes based on medical data. The models were tested in their basic forms and with different techniques for balancing the dataset, such as undersampling, oversampling, SMOTE, and an asymmetric approach. Additionally, variable reduction and probability averaging as a form of ensemble learning were applied. The experiments are based on the dataset available on the Kaggle platform, which contains 100,000 observations. The problem is interesting because diagnostic criteria based on glycated hemoglobin and blood glucose levels do not enable automatic and unambiguous diagnosis in this dataset. However, they will be important independent variables in the classification models considered. The results of the evaluation show the potential of machine learning in supporting specialists in diabetes diagnosis, and highlight the importance of proper data preprocessing for achieving better model performance.