Application of machine learning in diabetes diagnosis

Aleksandra Rosińska; Łukasz Smaga

doi:10.2478/bile-2025-0008

Abstract

Diabetes is a serious global health problem that affects millions of people and leads to many complications if not diagnosed early. Early and accurate diagnosis is very important for improving patient outcomes and reducing healthcare costs. Machine learning can help to analyze medical data and predict diabetes more effectively. This study compares three machine learning models – logistic regression, random forests, and XGBoost – for predicting diabetes based on medical data. The models were tested in their basic forms and with different techniques for balancing the dataset, such as undersampling, oversampling, SMOTE, and an asymmetric approach. Additionally, variable reduction and probability averaging as a form of ensemble learning were applied. The experiments are based on the dataset available on the Kaggle platform, which contains 100,000 observations. The problem is interesting because diagnostic criteria based on glycated hemoglobin and blood glucose levels do not enable automatic and unambiguous diagnosis in this dataset. However, they will be important independent variables in the classification models considered. The results of the evaluation show the potential of machine learning in supporting specialists in diabetes diagnosis, and highlight the importance of proper data preprocessing for achieving better model performance.

References

Ambrish G., Ganesh B., Ganesh A., Srinivas C., Dhanraj Mensinkal K. (2022): Logistic regression technique for prediction of cardiovascular disease. Global Transitions Proceedings 3, 127-130.
Search in Google Scholar Back to article
American Diabetes Association Professional Practice Committee (2025): 2. Diagnosis and Classification of Diabetes: Standards of Care in Diabetes – 2025. Diabetes Care 48, S27-S49.
Search in Google Scholar Back to article
Ardila D., Kiraly A.P., Bharadwaj S., Choi B., Reicher J.J., Peng L., Tse D., Etemadi M., Ye W., Corrado G., Naidich D.P., Shetty S. (2019): End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine 25, 954-961.
Search in Google Scholar Back to article
Dai B., Chen R.C., Zhu S.Z., Zhang W.W. (2018): Using Random Forest Algorithm for Breast Cancer Diagnosis. International Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, 2018, pp. 449-452.
Search in Google Scholar Back to article
Fenner M.E. (2019): Machine Learning with Python for Everyone. Addison Wesley.
Search in Google Scholar Back to article
Fiłon J. (2019): Diabetes – a public health challenge in the 21st century. Medical University of Białystok. https://pbc.biaman.pl/Content/60948/Cukrzyca_wyzwanie_zdrowia_publicznego_w_XXI_w.pdf [accessed: June 30, 2025]. (in Polish)
Search in Google Scholar Back to article
Kasperczuk A., Dardzińska A. (2017): Logistic Regression Methods in Selected Medical Information Systems. T.K. Dang et al. (Eds.): Future Data and Security Engineering 2017, LNCS 10646, pp. 168-177.
Search in Google Scholar Back to article
Konukoglu E., Glocker B. (2020): Random forests in medical image computing. Handbook of Medical Image Computing and Computer Assisted Intervention, The Elsevier and MICCAI Society Book Series, pp. 457-480.
Search in Google Scholar Back to article
Laguarta J., Hueto F., Subirana B. (2020): COVID-19 artificial intelligence diagnosis using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology 1, 275-281.
Search in Google Scholar Back to article
Małowiecki A. (2023): Data Resampling Methods to Solve Data Imbalance Problem in Credit Card Fraud Detection. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu, Wrocław (in Polish). https://dbc.wroc.pl/Content/125401/Malowiecki_Metody_resamplingu_danych_w_rozwiazaniu.pdf
Search in Google Scholar Back to article
McKinney S.M., Sieniek M., Godbole V., Godwin J., Antropova N., Ashrafian H., Back T., Chesus M., Corrado G.S., Darzi A., Etemadi M., Garcia-Vicente F., Gilbert F.J., Halling-Brown M., Hassabis D., Jansen S., Karthikesalingam A., Kelly C.J., King D., Ledsam J.R., Melnick D., Mostofi H., Peng L., Reicher J.J., Romera-Paredes B., Sidebottom R., Suleyman M., Tse D., Young K.C., Fauw J.D., Shetty S. (2020): International evaluation of an AI system for breast cancer screening. Nature, 577(7788), 89-94.
Search in Google Scholar Back to article
Ogłoszka A., Smaga Ł. (2022): Classification methods in the diagnosis of breast cancer. Biometrical Letters 59, 99-126.
Search in Google Scholar Back to article
Rajpurkar P., Irvin J., Ball R.L., Zhu K., Yang B., Mehta H., Duan T., Ding D., Bagul A., Langlotz C.P., Patel B.N., Yeom K.W., Shpanskaya K., Blankenberg F.G., Seekins J., Amrhein T.J., Mong D.A., Halabi S.S., Zucker E.J., Ng A.Y., Lungren M.P. (2018): Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine 15, e1002686.
Search in Google Scholar Back to article
Ranjbarzadeh R., Dorosti S., Ghoushchi S., Caputo A., Tirkolaee E., Ali S., Arshadi Z., Bendechache M. (2023): Breast tumor localization and segmentation using machine learning techniques: Overview of datasets, findings, and methods. Computers in Biology and Medicine 152, 106443.
Search in Google Scholar Back to article
R Core Team (2025): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. R version 4.5.1 (2025-06-13). https://www.R-project.org/.
Search in Google Scholar Back to article
Reszke M., Smaga Ł. (2023): Machine learning methods in the detection of brain tumors. Biometrical Letters 60, 125-148.
Search in Google Scholar Back to article
Rosińska A. (2025): Statistical methods and machine learning algorithms in the diabetes prediction problem. Master’s thesis in Data Science. Adam Mickiewicz University, Poznań (in Polish).
Search in Google Scholar Back to article
Saito T., Rehmsmeier M. (2015): The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10(3), e0118432.
Search in Google Scholar Back to article
Starmer J. (2022): The StatQuest Illustrated Guide To Machine Learning. StatQuest Publications.
Search in Google Scholar Back to article
Stoltzfus J. (2011): Logistic regression: A brief primer. Academic Emergency Medicine 18, 1099-1104.
Search in Google Scholar Back to article
Włodarczyk E., Diabetes prevention and early detection program. Łódź 2018-2020. https://rpo.lodzkie.pl/images/2017/808-nabor-10.3.2/zal14.pdf [accessed: June 30, 2025]. (in Polish)
Search in Google Scholar Back to article
Wojdan K., Moniuszko M. (2022): Artificial intelligence in medicine – current status and implementation challenges. NAUKA 3, 41-52. (in Polish)
Search in Google Scholar Back to article
Varma S., Simon R. (2006): Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91.
Search in Google Scholar Back to article

Application of machine learning in diabetes diagnosis

Abstract

Paradigm

My account