Table 1
Description of the DL-based CV algorithms experimented with in the present study. This includes a summary of the number of parameters (millions) in each algorithm if trained from scratch, the number of convolutional layers (Nº Conv), whether Batch Normalisation (B.N.) layers are included in the algorithm, and whether pretrained weights are available in TensorFlow for the purpose of transfer learning. The DS columns indicate which of the three datasets in this study were originally used to train algorithms by Abellá, Baquedano & Domínguez-Rodrigo (2022) (DS1), Domínguez-Rodrigo et al., (2020) (DS2), Pizarro-Monzo et al. (2023) (DS3). The reference list refers to the original publication of each of the datasets. * This number includes all of the convolutional layers included within the inception modules – large blocks of multiple convolutional layers. ** A precise number cannot be reported as the architecture works differently from the others.
| ALGORITHM | PARAMS. | Nº CONV | B.N. | PRETRAINED | DS | REFERENCES | ||
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||||
| Jason1 | ≈ 3 Mil. | 8 | False | False | x | Brownlee, 2017 | ||
| Jason2 | ≈ 3 Mil. | 8 | True | False | x | x | Brownlee, 2017 | |
| VGG16 | ≈ 15 Mil. | 13 | False | True | x | Simonyan & Zisserman, 2015 | ||
| DenseNet201 | ≈ 18 Mil. | 201 | True | True | x | x | x | Huang et al., 2017 |
| VGG19 | ≈ 20 Mil. | 16 | False | True | x | x | Simonyan & Zisserman, 2015 | |
| InceptionV3 | ≈ 22 Mil. | 96 * | True | True | x | Szegedy et al., 2015 | ||
| ResNet50 | ≈ 24 Mil. | 49 | True | True | x | x | x | He et al., 2016 |
| Alexnet | ≈ 35 Mil. | 5 | True | False | x | Krizhevsky et al., 2012 | ||
| EfficientNetB7 | ≈ 64 Mil. | ** | True | True | x | Tan and Le, 2020 | ||

Figure 1
Examples of ideal train/validation learning curves. These curves were obtained from a neural network trained on a toy dataset, alongside examples of both underfitting and overfitting neural networks.
Table 2
Descriptive statistics of the different metrics obtained from analysed images. Descriptive statistics of the different metrics extracted from images from the different classes present in DS1 and DS3; DS2 was excluded from the table as it is contained within DS3. Metrics include the Laplacian of Gaussian (LoG) variance, percentage of image presenting detectable features using Canny Edge Detection (CED), Fast Fourier Transform (FFT) magnitudes, the percentage of images presenting adequate levels of contrast, and the percentage of images presenting complications due to the presence of specularities (Spec.). Descriptive statistics report the central tendency followed by 95% confidence intervals constructed using distribution quantiles. CM = Cut Mark, Croc. = Crocodylian tooth score, TM = Carnivoran Tooth Mark, Tmp = Trampling.
| SAMPLE | LOG VARIANCE | FFT MAGNITUDE | CED (%) | CONTRAST (%) | SPEC. (%) |
|---|---|---|---|---|---|
| DS1-CM | 20.6 +/– [13.8, 55.6] | 9.1 +/– [0.6, 24.1] | 33.4 +/– [16.3, 56.7] | 45.0 | 11.3 |
| DS1-Croc. | 71.0 +/– [17.8, 230.8] | 19.0 +/– [5.7, 30.7] | 49.9 +/– [26.8, 65.7] | 95.7 | 58.7 |
| DS3-CM | 22.1 +/– [14.1, 70.2] | 8.7 +/– [0.7, 22.9] | 35.7 +/– [16.6, 66.3] | 29.6 | 9.8 |
| DS3-TM | 41.9 +/– [13.6, 133.3] | 16.2 +/– [–0.2, 28.0] | 48.3 +/– [14.6, 68.2] | 80.9 | 62.1 |
| DS3-Tmp | 44.1 +/– [10.2, 378.4] | 9.7 +/– [–6.7, 93.2] | 45.2 +/– [2.7, 87.8] | 47.3 | 40.0 |

Figure 2
Examples of images displaying exceptionally poor quality. Examples of photographs of cut marks from DS1 and DS3 presenting especially poor image quality, with a considerable portion of pixels out-of-focus towards the image border. Sobel gradient maps in the right-hand panels highlight these features with sharp changes in gradient being clearly visible in the centre of each image, while gradients towards the edges present a high degree of out-of-focus blur with almost no detectable features.

Figure 3
Examples of photographs presenting specular reflections. Examples of photographs of tooth marks from DS1 and DS3, presenting areas of abnormally intense brightness in certain pixels as a product of specular reflections. Panels on the right present pixels where these abnormalities have been detected.
Table 3
Evaluation metrics when predicting the different types of BSMs using the models trained on each dataset and evaluated on the test set. Note that a high accuracy value does not necessarily imply positive performance, as evidenced by the (imbalanced) DS1 dataset.
| DS1 | DS2 | DS3 | |
|---|---|---|---|
| Precision | 0.46 | 0.77 | 0.88 |
| Recall | 0.50 | 0.69 | 0.87 |
| F1 | 0.48 | 0.66 | 0.88 |
| Accuracy | 0.92 | 0.86 | 0.91 |
Table 4
Error rates (in %) when predicting the different types of BSMs using the models trained on each dataset and evaluated on the test sets. Error rates are reported as the RMSE of the labels.
| DS1 | DS2 | DS3 | |
|---|---|---|---|
| Tooth Score | – | 15.34 | 7.95 |
| Trampling | – | 55.30 | 22.35 |
| Cut Mark | 5.29 | 10.14 | 7.90 |
| Crocodile | 24.41 | – | – |
| Overall | 8.63 | 15.13 | 9.77 |
Table 5
Confusion matrix obtained when evaluating a Jason2 model trained on the DS1 test set. Note that the confusion matrix presents a true positive rate of 0; the algorithm classifies all samples as cut marks regardless of whether they are or not.
| PREDICTED | |||
|---|---|---|---|
| CROCODILE | CUT MARK | ||
| True | Crocodile | 0 | 13 |
| Cut Mark | 0 | 146 | |

Figure 4
Empirical learning curves for neural networks. Learning curves obtained from the best performing convolutional neural network architectures on each of the datasets; Jason2 for DS1, VGG16 for DS2, and DenseNet201 for DS3.
Table 6
Confusion matrix obtained when evaluating VGG16 on the test set of DS2 and DenseNet201 on the test set of DS3.
| PREDICTED DS2 | PREDICTED DS3 | ||||||
|---|---|---|---|---|---|---|---|
| CUT MARK | SCORE | TRAMPLING | CUT MARK | SCORE | TRAMPLING | ||
| True | Cut Mark | 134 | 11 | 1 | 163 | 2 | 6 |
| Score | 2 | 28 | 0 | 7 | 126 | 3 | |
| Trampling | 1 | 13 | 4 | 1 | 11 | 33 | |

Figure 5
Grad-CAM results displaying suboptimal detection of features. Grad-CAM results for a selection of images displaying particularly poor identification of relevant features for BSM classification. The lighter shades of yellow highlight areas where the CNN are identifying notable features. Darker areas leaning more towards blue indicate areas that are not of interest to the CNN when identifying each type of BSM.
