In 2023, it’s estimated that nearly 97,610 individuals in the US received the daunting diagnosis of invasive malignant melanoma, which was slightly lower than the 99,780 cases reported in 2022. Tragically, the number of lives lost to this disease increased from 7,650 in 2022 to an estimated 7,990 in 2023 [1–3]. Projections indicated that approximately 89,070 cases of early in situ melanoma would be identified in 2023. The silver lining is that when melanomas are detected and treated at this early stage, there’s a chance for complete recovery. Since most melanomas manifest visibly on the skin, it underscores the crucial importance of early detection and treatment [4]. Skin cancer, increasingly prevalent today, affects humans widely due to the skin’s status as the body’s largest organ [5]. When facing skin cancer, it’s important to be proactive and utilize all available resources to detect it early and enhance treatment outcomes. This type of cancer encompasses two main categories: melanoma and nonmelanoma skin cancer [6]. Melanoma, though rare, is perilous and often fatal, accounting for a mere 1% of cases yet yielding a higher death toll. Originating from melanocytes, which regulate skin pigmentation, melanoma can emerge anywhere on the body but commonly appears on sun-exposed areas like the hands, face, and neck. Early detection is crucial for effective treatment; otherwise, melanoma can metastasize, leading to dire consequences. Various subtypes exist, including nodular melanoma, superficial spreading melanoma, acral lentiginous, and lentigo maligna. Conversely, nonmelanoma skin cancers like basal cell carcinoma (BCC), squamous cell carcinoma (SCC), and sebaceous gland carcinoma (SGC) are more prevalent. These cancers typically arise in the outer layers of the skin and have a lower tendency to metastasize, making them relatively easier to treat compared to melanoma [6–12].
Fortunately, technology plays a crucial role in aiding dermatologists. Artificial intelligence acts as a reliable assistant, swiftly analyzing images of skin lesions to identify potential signs of cancer with greater speed and accuracy. Additionally, advanced imaging techniques like reflectance confocal microscopy and optical coherence tomography provide valuable insights into skin abnormalities without invasive procedures [3,13–15]. In the field of medical imaging, experts are hands-on in assembling datasets and guiding machine learning researchers on what aspects to consider. Consider the HAM10000 dataset, for instance; it’s a collection of images showing different skin lesions [17]. Since 2016, it has played a crucial role in the International Skin Imaging Collaboration (ISIC) challenge, assisting in addressing various skin-related issues [17–19]. There are other datasets like PH2, featuring 200 dermoscopic images sorted into various diagnosis groups, and the Interactive Atlas of Dermoscopy, which offers over 1000 clinical cases with thorough annotations and pathology findings. These resources have proven invaluable in advancing computer-aided diagnosis (CAD) research [21].
Segmentation is a foundational and intricate task in the automated analysis of skin lesions. Within clinical contexts, rule-based diagnostic systems heavily rely on precise lesion segmentation to estimate critical diagnostic parameters such as asymmetry, border irregularity, and lesion size accurately. These metrics serve as fundamental components for the application of established algorithms like the ABCD algorithm and its derivatives, such as ABCDE and ABCDEF.
Skin cancer detection methodologies and their respective datasets
| Reference | Methodology | Dataset(s) | Evaluation Metrics |
|---|---|---|---|
| [22] | The preprocessing images and fnetuning convolutional neural networks with transfer learning, with EffcientNet B4 identifed as the top-performing model. | HAM10000 dataset | F1 Score: 87%, Accuracy: 87.91% |
| [23] | Automated Skin-Melanoma Detection (ASMD) system using image processing and SVM-based classifcation, proposing a Melanoma-Index (MI) for clinical use. | DD image dataset | Accuracy: 97.50% |
| [24] | Automatic skin cancer diagnosis system including Histogram of Gradients (HG) and Histogram of Lines (HL), combined with other features. | HPH dermoscopy database and the Dermoft standard database | Accuracy: 98.79% (HPH) and 92.96% (the standardDermoft) |
| [25] | Skin cancer detection system utilizing Genetic Programming (GP) for evolving a classifer and feature selection. | PH2dataset | Accuracy: 97.92% |
| [26] | Image processing and deep learning techniques, including Convolutional Neural Networks (CNNs), for skin cancer detection and classifcation. | MNISTHAM10000 dataset | Weighted Average Accuracy: 0.88, WeightedAverage Recall: 0.74, Weighted F1-score: 0.77 |
| [27] | Classifcation of skin lesions, utilizing dynamic-sized kernels and both ReLU and leakyReLU activation functions. | HAM10000 dataset | Overall accuracy: 97.85% |
| [28] | Soft-Attention mechanism in deep neural architectures for skin lesion classifcation. | HAM10000 dataset and ISIC-2017 dataset | Precision: 93.7% (HAM10000), sensitivity: 91.6% (ISIC-2017) |
| [29] | MobileNetV3 introducing the Improved Artifcial Rabbits Optimizer (IARO) algorithm to enhance feature selection | PH2, ISIC-2016, and HAM10000 datasets | Accuracy: 87.17% (ISIC-2016), 96.79% (PH2 dataset), and 88.71% (HAM10000) |
| [30] | SkinTrans, an improved transformer network, for skin cancer classifcation, utilizing vision transformers (VIT) with self-attention mechanism. | HAM10000 and clinical datasets | Accuracy: 94.3% (HAM10000) and 94.1% (Clinical) |
The ABCD algorithm provides a structured framework for evaluating skin lesions, with each letter representing key morphological features. Accurate lesion segmentation is imperative for enabling reliable and effective automated diagnostic assessments, thereby enhancing clinical decision-making processes. As such, robust segmentation methodologies are essential for ensuring the integrity and clinical utility of automated skin lesion analysis systems [30–33].
In the world of academic research, three-field plots (Figure 1) and co-occurrence networks (Figure 2) serve as invaluable tools for understanding the intricate web of connections within scholarly literature (93 reviews). These visualizations provide insights into how keywords, university affiliations, and authors’ countries of origin intersect. These visual tools provide a snapshot of how ideas move and evolve within academia. Conceptualize these tools as architectural schematics, opening the complex interrelations among these elements. The height of the boxes within the three-field plot signifies the volume of publications associated with each affiliation, offering a straightforward gauge of research output. A taller box indicates a higher number of publications from that affiliation, a key metric for evaluating research productivity.

Three-field plot analysis (AU_UN—ID—AU_CO)

Co-occurrence network of author keywords
In research publications, a graphical representation (see Figure 2) called the co-occurrence network of author keywords shows the connections between different author keywords. In the same research documents, individual keywords are nodes within these networks and edges that connect these nodes indicate the repeated co-occurrence of such words often. Through the research field or a specific set of publications, these networks highlight essential information by presenting an elastic picture of the underlying relationships. Upon closer examination of Figure 1, it is revealed that Germany, the United States, and Australia prominently emerge as the forerunners in academic endeavors that chronicle the steps made within the artificial-intelligence driven melanoma.
In Figure 2, specifically tailored analysis of keywords associated with artificial intelligence is offered, providing a comprehensive breakdown with a total of 30 nodes.
In this study the feature blocks of these pretrained Convolutional Neural Networks (CNNs) [35] have been used as the feature extractor of the given image data, and they have been compared according to the classification metrics of the whole proposed architecture, which consists of the feature extractor block from the pretrained model, with that the best metrics have been obtained and a classification block for the image data, as well as another classification block for the tabular data, and a last classification block from which the combined image and tabular data pass.
ConvNeXt Base, an innovative CNN architecture, represents a significant advancement in the field of computer vision. While not as widely recognized as some mainstream architectures, ConvNeXt Base has garnered attention for its robust feature extraction capabilities and computational efficiency.
Notable for its utilization of grouped convolutions, layer normalization, and stochastic depth regularization, ConvNeXt Base aims to strike a delicate balance between model complexity and performance. In terms of performance, ConvNeXt Base has demonstrated promising results in various image classification tasks. For instance, researchers have employed ConvNeXt Base to classify complex medical images, such as those related to tumor detection in mammography and histopathological analysis. Its ability to capture intricate visual patterns while maintaining computational efficiency makes ConvNeXt Base a compelling choice for diverse applications. Architecturally, ConvNeXt Base comprises a series of CNBlocks, each featuring grouped convolutions, linear transformations, and permutations. These blocks, augmented with stochastic depth layers, facilitate enhanced model robustness and generalization capabilities. Furthermore, ConvNeXt Base leverages layer normalization and various activation functions to capture complex patterns effectively.
MobileNet V3 Large stands as a testament to the evolution of lightweight CNN architectures optimized for mobile and embedded devices. Built upon the success of its predecessors, MobileNet V3 Large prioritizes computational efficiency without sacrificing classification accuracy. Its architectural design, characterized by depth-wise separable convolutions, inverted residual blocks, and squeeze-and-excitation modules, enables efficient feature extraction across diverse datasets.
In practical applications, MobileNet V3 Large has found widespread adoption, particularly in scenarios where resource constraints pose challenges. For example, it has been instrumental in various mobile applications, ranging from real-time object detection to image classification in low-power devices. The incorporation of hard-swish activation functions and dropout layers further enhances its performance and regularization capabilities. Architecturally, MobileNet V3 Large comprises a hierarchical structure of depth-wise separable convolutions and inverted residual blocks, facilitating efficient feature extraction and spatial downsampling. Additionally, squeeze-and-excitation modules enable adaptive feature recalibration, enhancing the discriminative power of the model across different input domains.
VGG16, a seminal CNN architecture, has left an indelible mark on the landscape of image classification. Renowned for its simplicity and effectiveness, VGG16 remains a cornerstone in the field despite the emergence of newer architectures. Its architectural design, characterized by repeated blocks of convolutional layers followed by rectified linear unit (ReLU) activation functions and max-pooling operations, prioritizes feature extraction and spatial downsampling.
In practice, VGG16 has been extensively utilized in various computer vision tasks, ranging from image recognition to object localization [6]. Its straight-forward design and strong performance make it a popular choice for benchmarking and experimentation in research settings. Architecturally, VGG16 comprises a hierarchical structure of convolutional layers, augmented by ReLU activation functions and maxpooling operations. This design fosters hierarchical feature extraction, enabling the model to capture increasingly abstract representations as information traverses deeper into the network. Additionally, the incorporation of max-pooling layers facilitates spatial downsampling, reducing computational complexity while preserving discriminative information.
EfficientNet V2 S emerges as a pinnacle in the realm of scalable and efficient CNN architectures, embodying state-of-the-art advancements in model design and optimization. With a focus on achieving optimal performance across varying computational budgets, EfficientNet V2 S has garnered widespread acclaim for its adaptability to diverse deployment scenarios. Its architectural design, characterized by a standard convolutional layer followed by fused MobileNetV2 (MBConv) blocks, epitomizes efficiency without compromising classification accuracy.
In practical applications, EfficientNet V2 S has demonstrated remarkable efficacy across a spectrum of tasks, from image classification to object detection. Its hierarchical structure and strategic incorporation of stochastic depth layers contribute to enhanced model robustness and generalization capabilities. Architecturally, EfficientNet V2 S comprises a cascade of MBConv blocks, each featuring depth-wise separable convolutions and efficient channel attention mechanisms. This design fosters efficient feature extraction and aggregation, enabling the model to capture complex visual patterns while minimizing computational overhead. Additionally, the utilization of stochastic depth layers enhances model regularization, contributing to improved performance on diverse datasets.
DenseNet161 stands as a paradigmatic shift in CNN architectures, distinguished by its dense connectivity pattern that promotes feature reuse and gradient flow propagation throughout the network. Renowned for its superior performance in capturing intricate visual representations, DenseNet161 has become a cornerstone in image classification tasks, particularly in scenarios requiring robust feature extraction. In practical applications, DenseNet161 has demonstrated remarkable efficacy across various domains, including medical image analysis and remote sensing. Its dense connectivity pattern and hierarchical structure facilitate information flow, enabling the model to capture increasingly abstract representations as information traverses deeper into the network.
Architecturally, DenseNet161 comprises a series of dense blocks, each featuring densely connected layers that receive direct input from all preceding layers within the block. This design promotes feature reuse and gradient flow propagation, facilitating efficient training and improved model expressiveness. Additionally, transition layers and global average pooling operations further enhance feature aggregation and classification performance, making DenseNet161 a formidable contender in the realm of image classification architectures.
The primary dataset utilized in this study is the SIIM-ISIC Melanoma Classification Competition Dataset sourced from Kaggle. This dataset comprises a comprehensive collection of medical images and corresponding metadata essential for melanoma classification research. The images are primarily dermoscopic images of skin lesions, while the metadata includes critical patient information such as identifiers, sex, age, anatomical site, diagnosis, and malignancy status.
Upon acquisition, the dataset underwent extensive preprocessing to ensure uniformity and compatibility with the model architecture. The images were initially available in multiple formats, including DICOM, JPEG, and TFRecord. To facilitate ease of processing and analysis, all images were converted to a standard format, namely JPEG. Additionally, the images were resized to a uniform resolution of 224×224 pixels to ensure consistency across the dataset.
To enhance the diversity and robustness of the training dataset, various data augmentation techniques were applied. Leveraging the transforms module from the PyTorch library, augmentations such as random vertical and horizontal flips were performed on the images. These augmentations help the model generalize better by introducing variations in the training data, thereby reducing overfitting tendencies.
Furthermore, normalization was applied to the image data to standardize the pixel values. This normalization process involved scaling the pixel values to a standard range, typically between 0 and 1, to stabilize the training process and accelerate convergence.
Convolutional Neural Networks (CNNs) have become integral in various domains, particularly in computer vision tasks, due to their efficacy in learning and extracting features from visual data. CNNs are adept at discerning intricate patterns and structures within images, making them suitable for tasks like image classification, object detection, and segmentation.
In this study, CNNs are utilized to tackle the crucial task of classifying benign and malignant skin lesions in dermatology. The objective is to leverage CNNs to identify underlying patterns and structures within skin lesion images accurately distinguishing between benign and malignant cases.
The dataset (Table 2) incorporates both tabular data, which consists of patient demographics (sex, approximate age) and lesion location, and image data characterized by three channels and dimensions of 224×224 pixels. The image data undergoes a sequential process, starting with convolutional layers, followed by batch normalization layers, and then Hardwish activation layers from the MobileNet V3 Large Model’s Feature Block, to extract pertinent features. Subsequently, it passes through linear, batch normalization, and dropout layers to prepare it for classification. Meanwhile, the tabular data undergoes transformations via multiple linear and ReLU layers. These two sets of data are combined and processed through a linear layer. For classification, the combined data then proceeds through a sigmoid activation function and is rounded as final classification.
Number of tabular and image data taken from SIIM-ISIC Dataset
| Class | Total | Training | Testing |
|---|---|---|---|
| Malignant | 571 | 461 | 110 |
| Benign | 579 | 459 | 120 |
| Total | 1150 | 920 | 320 |
Incorporating MobileNetV3’s feature extraction mechanism, this model harnesses the efficiency and sophistication of modern CNN architectures. MobileNetV3’s feature extractor is pivotal in distilling intricate patterns from skin lesion images. Comprising multiple layers, as shown in Figure 3 MobileNetV3 initiates with a convolutional layer, followed by batch normalization for stabilization and Hardswish activation for non-linearity. This is particularly effective in capturing diverse features across various skin lesion types. Subsequently, MobileNetV3’s architecture integrates Inverted Residual blocks, each composed of depthwise separable convolutions and pointwise convolutions. These intricate structures enable the model to efficiently capture spatial hierarchies and semantic information within the image data.

Proposed model architecture
The image features extracted by MobileNetV3 are then further refined through the image classifier component. This classifier comprises multiple layers, starting with a flattening operation to convert the multi-dimensional feature maps into a one-dimensional tensor. Subsequently, the flattened features are processed through a sequence of linear transformations, ReLU activations, batch normalization for feature scaling, and dropout regularization to mitigate overfitting risks.
These operations collectively facilitate the learning of discriminative features essential for accurate lesion classification. Concurrently, the tabular data undergoes processing through a dedicated tabular classifier component. Similar to the image classifier, this component comprises multiple fully connected layers augmented with ReLU activation functions. These layers enable the model to capture intricate relationships within the tabular data, leveraging patient demographics and medical history for improved lesion classification accuracy.
Finally, the fused features from both the image and tabular classifiers are concatenated and passed through the fusion layer. This layer consists of a flattening operation followed by a linear transformation. While simpler in structure compared to the classifiers, the fusion layer synthesizes the extracted features from both modalities, facilitating a holistic analysis of the input data and ultimately contributing to enhanced classification performance. By integrating both image and tabular data sources and leveraging sophisticated feature extraction mechanisms and classification components, this model presents a comprehensive approach to skin lesion classification.
For conducting the experiments, a computational setup consisting of an Intel Core i9 processor and an NVIDIA GeForce RTX GPU was utilized. The experiments were executed within the VS Code Jupyter Note-book interface leveraging the PyTorch framework for model implementation. The training process spanned over 35 epochs and took approximately 28 minutes to complete on the aforementioned hardware configuration. The architecture of the proposed methodology is shown in Figure 4.

Proposed methodology
The experimental setup encompassed the evaluation of multiple neural network architectures. Loss and accuracy curves were generated for each architecture, as depicted in the following figures.
In the study of skin lesion classification, the experimental process is initiated with the collection of two distinct types of data. Image Data, which consist of skin lesion images, and Tabular Data, which encompass patient demographics and clinical information. The data then undergoes a preprocessing stage. This stage is multifaceted and includes the removal of any null values present in the data, the elimination of zero values from the ‘age_approx’ column, and the custom encoding of the ‘sex’ and ‘anatomical_site_general_challenge’ columns.

Densenet161 loss and accuracy curves
Additionally, data selection is performed based on the target value, selecting all data with a malignant target value and sampling an equal amount of data with a benign target value. The images are then processed, resized to a standard size of (224, 224, 3), augmented to increase the diversity of the dataset, normalized, and converted to tensors with a float data type.
Concurrently, ‘sex’, ‘age_approx’, and ‘anatomical_site_general_challenge’ are extracted as independent variables from the tabular data. The Image Tensors and Tabular Tensors are then utilized to train the model. The Image Tensors are processed through an MobileNet V3 Large model for feature extraction, while the Tabular Tensors are processed through a Multilayer Perceptron. The training process involved iteratively optimizing the model parameters using the Adam optimizer with a learning rate of 0.01.

Convnext_base loss and accuracy curves

Mobilenet_v3_large loss and accuracy curves
The loss function employed was the Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss), which is well-suited for binary classification tasks such as melanoma classification.
Finally, the model is evaluated using a separate test set, outputting a prediction of either ‘Benign’ or ‘Malignant’ for each skin lesion. To evaluate the performance of the trained model, a comprehensive suite of evaluation metrics was employed. The primary metrics used were the confusion matrix and the classification report. The confusion matrix provides a detailed breakdown of the model’s predictions, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). From the confusion matrix, metrics such as accuracy, precision, recall (sensitivity), and F1-score can be derived.
The results of this study highlight MobileNet V3 Large as the optimal pretrained model for skin lesion classification, offering superior accuracy and robustness. These findings underscore the potential of deep learning models in enhancing diagnostic accuracy and guiding clinical decision-making in melanoma detection. The results obtained from the evaluation on the test set are summarized in Table 3, while Table 4 provides the corresponding confusion matrix.

VGG16 loss and accuracy curves

Efficientnet_v2_s loss and accuracy curves
Performance on the test set
| Accuracy(%) | Precision(%) | Recall(%) | F1-Score(%) | |
|---|---|---|---|---|
| Effcientnet_v2_s | 91.74 | 96.33(B)/87.60(M) | 87.50(B)/96.36(M) | 91.70(B)/91.77(M) |
| Convnext_base | 77.39 | 82.07(B)/73.38(M) | 72.50(B)/82.72(M) | 76.99(B)/77.77(M) |
| Densenet161 | 98.69 | 97.56(B)/1.00(M) | 1.00(B)/97.27(M) | 98.76 (B)/98.61 (M) |
| Mobilenet_v3_large | 99.56 | 1.00(B)/99.09(M) | 99.16(B)/1.00(M) | 99.58(B)/99.54(M) |
| VGG16 | 87.39 | 85.83(B)/90.83(M) | 90.83(B)/85.83(M) | 88.26(B)/88.26(M) |
Confusion matrix on the test set
| Model | TP (B) | TN (M) | FN | FP |
|---|---|---|---|---|
| Effcientnet_v2_s | 105 | 106 | 15 | 4 |
| Convnext_base | 87 | 91 | 33 | 19 |
| Densenet161 | 120 | 107 | 0 | 3 |
| Mobilenet_v3_large | 119 | 110 | 1 | 0 |
| VGG16 | 109 | 92 | 11 | 18 |
This study investigated the performance of different pretrained models as feature extractors for distinguishing benign and malignant skin lesions. Among the models examined, MobileNet V3 Large demonstrated the highest accuracy, achieving an outstanding 99.56%.
A detailed comparison of the models’ performance metrics, encompassing accuracy, precision, recall, and F1-score, is presented in Table 3. MobileNet V3 Large exhibited superior performance across all metrics, indicating its effectiveness in capturing discriminative features from skin lesion images. With precision and recall scores of 1.00 for benign and malignant lesions, respectively, MobileNet V3 Large demonstrated exceptional accuracy in identifying both classes.
EfficientNet V2 Small and VGG16 also yielded competitive results, achieving accuracies of 91.74% and 87.39%, respectively. However, they displayed slightly lower precision and recall compared to MobileNet V3 Large. Conversely, models such as ConvNext Base exhibited relatively lower performance, suggesting limitations in feature extraction from skin lesion images.
The confusion matrices provided in Table 4 offer insights into the models’ classification performance, depicting TP, TN, FP, and FN. MobileNet V3 Large achieved the highest TP rate for both benign and malignant lesions, with only one FN and no FP instances. This underscores its robustness in accurate lesion classification, minimizing misclassifications, and mitigating the risk of false diagnoses.
In conclusion, the choice of a pretrained model as a feature extractor significantly impacts the performance of skin lesion classification tasks. MobileNet V3 Large emerged as the optimal model, offering superior accuracy and robustness in distinguishing between benign and malignant lesions. Future research could explore ensemble methods or fine-tuning strategies to further enhance model performance and generalization capabilities.
Concluding this study emphasizes the severity of the global skin cancer issue, particularly melanoma, due to its aggressive behavior and potential to spread. Early detection is pivotal in combatting this disease and saving lives. The introduction of a novel approach to melanoma classification utilizing the MobileNet-V3-Large model represents a significant advancement in this field.
By integrating skin lesion images with pertinent patient data such as age, gender, and lesion location, this method demonstrates enhanced predictive capabilities for distinguishing between malignant and benign lesions.
The comprehensive analysis of a vast dataset confirms the superiority of this approach over existing methods, as evidenced by an impressive accuracy rate of 99.56%.
These findings hold promise for revolutionizing melanoma diagnosis and subsequently reducing mortality rates associated with this disease. By harnessing the power of multi-input methodology, clinicians can anticipate more accurate and timely diagnoses, thereby facilitating prompt intervention and improved patient outcomes. This research heralds a new era in melanoma detection, underscoring the potential of innovative technologies to mitigate the impact of this deadly disease on global health.
Future research attempts should prioritize the refinement and optimization of multi-input models, the integration of emerging imaging technologies, and the development of user-friendly AI-driven diagnostic tools. Additionally, extensive validation studies and clinical trials are imperative to confirm the real-world effectiveness of novel diagnostic approaches. Furthermore, there is a pressing need to explore personalized medicine strategies tailored to individual patient characteristics, while simultaneously addressing global disparities in access to advanced diagnostic and treatment modalities. By following these paths, the field can strive towards more accurate diagnosis, personalized treatment regimens, and equitable healthcare access, ultimately leading to improved outcomes and reduced mortality rates for individuals affected by melanoma worldwide.
