Abstract
Aim
The aim of the study was to train, evaluate, and optimize machine learning models for classifying focal lesions in the thyroid gland as benign or malignant based on their features.
Material and methods
A dataset of 841 focal thyroid lesions described by 17 features (ultrasonographic and patient characteristics) was considered. Using the Python programming language, statistical and then exploratory data analyses were conducted using the libraries, including the generation of graphs and heat maps of correlations between the considered features. Binary classification models were selected to categorize the focal lesion on the basis of their characteristics into one of two classes (benign lesion, malignant lesion). The following models were used: logistic regression-based, support vector machine-based, k-nearest neighbor model, Random Forest model, and decision tree classifier. We applied formulas to select those focal lesion features that most contributed to the models’ classification decisions. The final dataset consisted of 841 focal thyroid lesions described by seven ultrasonographic features and histopathological assessment of malignancy (benign or malignant). Classifiers were validated using 10-fold cross-validation. Model performance was evaluated using sensitivity, accuracy, measure-F, precision, area under the ROC curve, PPV, NPV, specificity.
Results
The best-performing model (in term of sensitivity) was the classifier based on a support vector machine: sensitivity = 71.17%, accuracy = 83.24%, area under the ROC curve = 84.86%, measure f1 = 69.13%, precision = 68.85%, PPV = 68.49%, NPV = 89.06%.
Conclusions
The study demonstrates the usefulness of data science methods in predicting the malignant nature of focal lesions in the thyroid gland. It proves that classification decisions made by the studied models are based on specific ultrasonographic features associated with increased or reduced risk of malignancy.