Deep Learning-Based Computer Vision Is Not Yet the Answer to Taphonomic Equifinality in Bone Surface Modifications

Lloyd Austin Courtenay; Nicolas Vanderesse; Luc Doyon; Antoine Souron

doi:10.5334/jcaa.145

Full Article

1. Introduction

The concept of equifinality recognises that a number of different processes or agents can lead to the same outcome or result, highlighting the need for a holistic and nuanced understanding of certain complex, open systems in order to avoid reaching erroneous conclusions. With time, equifinality has become a fundamental component of taphonomic research, highlighting how different agents can produce some very similar modifications to the surface of bones. This is considerably important when using taphonomic data to answer pivotal questions about human evolution, such as the role that meat eating had in the diet of early hominins (e.g., Shipman & Walker, 1989; Aiello & Wheeler, 1995; Domínguez-Rodrigo, Barba & Egeland, 2007; Domínguez-Rodrigo et al., 2021a; Pobiner, 2020; Barr et al., 2022), or the hunting/scavenging capacities these ancient populations may have had (e.g., Binford, 1981; Blumenschine, 1995; Bunn, 1991; Selvaggio, 1998; Pickering, 2013; Domínguez-Rodrigo, 2015; Pante et al., 2015; Domínguez-Rodrigo, Barba & Egeland, 2007; Domínguez-Rodrigo et al., 2021a; Pobiner, 2020).

Historically, particular interest has been paid to the differentiation of different bone surface modifications (BSMs), including discerning between cut marks and naturally produced trampling marks (Oliver, 1984; Andrews & Cook 1985; Shipman & Rose, 1983; Shipman, 1988; Domínguez-Rodrigo et al., 2009), as well as fine linear traces that can also be produced by other biotic agents, such as crocodylians (e.g., Njau & Blumenschine, 2006; Drumheller & Brochu, 2014; Baquedano, Domínguez-Rodrigo & Musiba, 2012; Njau & Gilbert, 2016; Sahle, Zaatari & White, 2017) or vultures (Reeves, 2009; Domínguez-Solera & Domínguez-Rodrigo, 2011). While these studies prominently improved our understanding of the variability and nature of these BSMs, most diagnostic variables are based on qualitative and visual criteria, which are often biased by subjective factors such as analyst experience, perspective, or observation conditions (Domínguez-Rodrigo et al., 2017). Likewise, overlaying biostratinomic, diagenetic, and trephic processes are also known to considerably alter the appearance of BSMs (Pineda et al., 2014; 2019; 2023; Gümrükçu & Pante, 2018; Valtierra, Courtenay & López-Polín, 2020), thus increasing confounding factors leading to perceived equifinality.

One point upon which all of the aforementioned studies agree is the necessity for more objective and scientifically robust strategies to overcome the challenges posed by equifinality in taphonomic research. Recently developed methods using Artificial Intelligence (AI) are at the forefront of this line of investigation and are growing in popularity due to the more accessible nature of machine and deep learning tools. Here we focus on the use of deep learning (DL)-based computer vision (CV) for the classification of images obtained using optical microscopes of different BSMs.

CV is a field dedicated to enabling computers and machines to interpret and understand visual stimuli from the world around them. Using these tools, CV research develops algorithms and techniques that allow computers to extract meaningful insights and information from different media such as images (Schenk, 1999; González-Aguilera, 2005). This task, however, is far from simple, and often requires complex mathematical and computational models that are capable of extracting features from pixel data (Hartley & Zisserman, 2003; Zitová & Flusser, 2003; Lowe, 2004; Kaehler & Bradski, 2017; González-Aguilera et al., 2018; 2020; Ruiz de Oña et al., 2023). DL, on the other hand, is generally considered a branch of computational science that is dedicated to the development of algorithms, primarily artificial neural networks, with multiple (deep) layers that can automatically extract (learn) intricate patterns and representations from data (LeCun, Bengio & Hinton, 2015; Goodfellow, Bengio & Courville, 2016). For image processing, these mechanisms often manifest as convolutional neural networks (CNNs), i.e. complex algorithms that extract hierarchical features from images by moving (convolving) across the image, and using different filters to extract these features. Through their adeptness at unravelling structured data, DL has proven to be a protagonist in the revolutionising of tasks such as the classification and recognition of different elements in images (LeCun et al., 1998; Deng et al., 2009; 2010; Krizhevsky & Hinton, 2009; Krizhevsky, 2010; Krizhevsky, Sutskever & Hinton, 2012; Szegedy et al., 2016; Redmon et al., 2016).

Since its introduction as a means to overcome problems related with equifinality between different BSMs (Byeon et al., 2019), DL-based CV has been experimented with on a number of accounts (Cifuentes-Alcobendas & Domínguez-Rodrigo, 2019; 2021; Jiménez-García et al., 2020a; Abellán et al., 2021; Abellán, Baquedano & Domínguez-Rodrigo, 2022; Domínguez-Rodrigo et al., 2024), with some applications to actual archaeological remains (Abellán et al., 2021; Domínguez-Rodrigo et al., 2021b, c; Cobo-Sánchez et al., 2022; Pizarro-Monzo et al., 2022; 2023; Vegara-Riquelme et al., 2023; Moclán et al., 2024). In each of these cases, DL-based CV was presented as a promising tool, yet little attention was made to consider that DL is still a field that is under development and is not devoid of limitations (see Discussion).

Here, some empirically based concerns are presented on the fragile nature of these algorithms for the identification of BSMs. A discussion follows on some of the most notable limitations faced by DL-based CV. Using previously published image datasets, this study shows that: (1) the quality of available datasets is inadequate for in-depth analyses, and (2) high accuracy values are misleading. Our reassessment questions the current reliability of these approaches for discerning BSMs. This study therefore serves as a cautionary note, alongside the recent work of Calder et al. (2022), McPherron et al. (2022), and Yezzi-Woodley et al., (2024), to argue that considerable developments are still required before wide scale application of highly complex computational techniques to the study of the fossil record.

2. Material and Methods

2.1. Datasets

The present study aims to assess the reliability of DL-based CV on the classification of experimental taphonomic BSMs. For this purpose, this study has taken previously published datasets, analysed the quality of images, and replicated the analyses performed. The datasets consist in; (1) that published by Abellán, Baquedano and Domínguez-Rodrigo (2022), available from https://doi.org/10.7910/DVN/9NOD8W [Last accessed 07/08/2023], here referred to as DS1; (2) the dataset published by Domínguez-Rodrigo et al. (2020), available from https://doi.org/10.7910/DVN/62BRBP [Last accessed 07/08/2023], here referred to as DS2; and (3) the dataset published by Pizarro-Monzo et al. (2023), available from https://doi.org/10.17632/3bm34bp6p4.1 [Last accessed 07/08/2023], here referred to as DS3.

DS1 consists of 488 photographs of cut marks and 45 photographs of crocodylian tooth scores. DS2 consists of 489 cut marks, 103 carnivoran tooth scores, and 63 trampling marks. Finally, DS3 consists of the same images from DS2 with an additional 51 cut marks, 629 carnivoran tooth scores, and 154 trampling marks. According to their description, all three datasets were obtained using a binocular stereomicroscope at × 30 magnification using the same lighting intensity and angle, while images were cropped so as to ensure that only the mark and its shoulders are visible in the image. As described by the original authors, pre-processing has already been performed on these images to ensure that they are all “black and white”, yet the original authors do not indicate how this pre-processing was originally performed. The workflow for processing DS1 and DS2 then homogenise images to have a size of 80 × 400 pixels, and DS3 transforms images to have a size 250 × 200 pixels.

It is important to point out here a number of terminological clarifications that may have an impact on the quality of results, primarily regarding the use of the term black-and-white. Technically, and from the perspective of computer science, the images are in fact greyscale 8-bit images with 3 channels, where R=G=B, and not single channel 8-bit greyscale images, or binary black-and-white images. This is important as this specific greyscale conversion has not reduced the number of channels included within each image, as each image still contains three pixel values, i.e. three channels. From this perspective, no change is made to the size of the images used as input into algorithms, as would occur if images were converted to single-channel greyscale images. The only transformation applied to the images’ size is to their height and width; the number of numeric values is inherently the same (see Discussion for further comments on this point). The reason this conversion is carried out, however, is to facilitate transfer learning, which would not be possible if the images used for taphonomic studies are single-channel greyscale images.

Datasets from other research articles were not considered as they were either unavailable or not relevant to the present study. Among the latter we excluded datasets aimed at the identification of specific carnivores producing tooth marks (Abellán et al., 2021; Domínguez-Rodrigo et al., 2024), or discerning between raw materials producing cut marks (Cifuentes-Alcobendas & Domínguez-Rodrigo, 2021). The rationale behind this exclusion is that if DL-based CV is shown to be incapable of discerning among general categories of BSMs, such as cut marks and trampling marks, then this would automatically cast similar doubt on the model’s ability to differentiate subtler subcategories of marks produced by overall similar taphonomic agents.

2.2. Computer Vision and Image Analysis

To analyse the quality and suitability of these datasets for the purpose of more advanced, DL-based analyses, we performed a detailed exploratory analysis of the properties (blurriness, contrast, specularity) of each image. This constitutes a fundamental prerequisite in the curation of any dataset, especially when considering complex sources of information such as photographs.

The first step considered whether images were found to be in-focus or not. This preliminary assessment is crucial, considering fundamental variables in microscopy, such as focal distance, depth of field, and field of view. This phase of the analysis aimed to evaluate the clarity and sharpness of images by examining the gradient of each pixel. A pixel’s gradient represents information regarding both the rate and direction of change in intensity at this location, indicating the position and orientation of this change. Positive gradients indicate that a pixel is brighter than its neighbours, while negative values indicate a darker pixel. Ideally, the edges of features are sharply defined, leading to noticeable changes in a pixel’s absolute gradient. Conversely, images exhibiting out-of-focus blur present a gradual gradient transition from one pixel to the next (Ali & Mahmood, 2018).

Several focus measure operators exist for the analysis of images (Pertuz, Puig & García, 2013; Kaehler & Bradski, 2017), each with different levels of robustness and appropriateness for different types of images. The present study has considered four main methods for assessing image quality;

Laplacian of Gaussian (LoG) filters, followed by the calculation of LoG gradient variance to find areas of rapid change (edges) in images. This approach closely follows the recommendations of Pech-Pacheco et al. (2000), with adaptations using a Gaussian pre-filtering to eliminate the impact of potential noise.
Fast Fourier Transform (FFT) to calculate the mean magnitude of changes across the spectrum of frequencies within the image.
Canny Edge Detector (CED) to identify the number of detectable features in the image (Canny, 1986). Edges are first calculated using the CED and then dilated using morphological operators to remove as many dead pixels as possible. We then calculated the percentage of pixels in each image presenting detectable features, with the understanding that areas that are not in-focus will display a lower frequency of clearly identifiable edges.
Calculation of Sobel Gradient Maps (SGMs) to visually assess which areas of the image are in-focus or not.

Defining an optimal threshold to determine whether an image is in-focus or not can sometimes be problematic for each of these methods (Pech-Pacheco et al., 2000; Pertuz, Puig & García, 2013), with most case studies requiring a threshold that is particular to their type of image. For LoG, it is understood that images that are in-focus will present higher LoG variance than an image that is of poor quality. The same can be said of FFT, while obtaining images with a high number of detectable features through CED is also optimal. Considering these conditions, preliminary experiments presented in the Supplementary Materials located at https://github.com/LACourtenay/Reevaluating_Computer_Vision_Taphonomic_Equifinality, and previous publications of these metrics (Pech-Pacheco et al., 2000; Pertuz, Puig & García, 2013), we consider images to be in-focus if the image presents a LoG variance above 100, an FFT magnitude above 10, and a percentage of features detected through CED above 50%. Likewise, visual inspection of SGMs require that features are sharp across the entire map.

Next, images were analysed to assess whether they present optimal levels of contrast. Contrast is a direct means of assessing whether images were taken in optimal and controlled lighting conditions. An image is considered to present insufficient contrast if the differences between light and dark regions are harder to detect, thus hindering the ability to identify features such as edges and corners (Kaehler & Bradski, 2017). We can determine whether the image is of acceptable contrast by considering the ratio between the 1st and the 99th percentiles of pixel intensities and the actual range of values. If this ratio is found to fall below a threshold, here set at 0.35 (sensu Rosebrock, 2021), then the image is considered of inadequate contrast.

Finally, we considered the appearance of specular reflections. Specular reflections occur when light hitting the surface of an object is reflected at an angle equal to the incident angle. This results in intense or even saturated brightness in certain pixels. In some cases, specular reflections may hinder processes such as feature extraction, and thus require a certain level of control over lighting issues (Feris et al., 2006; Morgand & Tamaazousti, 2014; Guo et al., 2016). For this analysis, we assessed the variability of the Value (V) channel from Hue-Saturation-Value (HSV) colour-space, considering pixels with V > 2B, where B is the average intensity of pixel values within the image, to be an indication of specularities within the image (Morgand & Tamaazousti, 2014). Here, we analysed whether specularities were present, as well as the percentage of pixels detected as being specularities. To account for possible noise, an erosion morphological operator was used to clean specular maps.

2.3. Deep Learning Algorithms

While neither of the original studies utilizing DL-based CV for these datasets provided accessible code for replicating analyses, valuable information is available in the supplementary materials provided by Domínguez-Rodrigo et al. (2020). These materials offer insights into the neural network architectures that can be employed to construct comparable DL algorithms. Despite the absence of code, the elucidation of these architectures in the supplementary materials serves as a useful resource for researchers seeking to implement and reproduce DL analyses in a similar context. Likewise, the supplementary materials provided in Cobo-Sánchez et al. (2022) are also useful for the replication of these types of analyses, available from https://doi.org/10.7910/DVN/BQTKBA [Last accessed 07/08/2023]. For this purpose, attempts were made to replicate the architectures described in each of the original analyses for the classification of different images. Algorithms were then selected based on the best performance during evaluation, as per the descriptions provided by their original authors.

As the objectives of the present study are only to replicate, rather than present, the methods, no further description of the models will be provided. Table 1 nonetheless provides a general description of each algorithm; original publications by Domínguez-Rodrigo et al. (2020), Abellán, Baquedano & Domínguez-Rodrigo (2022), and Pizarro-Monzo et al. (2023) can be consulted for further interest. It is important to note that no alterations to algorithm architecture nor the way they were trained have been made.

Table 1

Description of the DL-based CV algorithms experimented with in the present study. This includes a summary of the number of parameters (millions) in each algorithm if trained from scratch, the number of convolutional layers (Nº Conv), whether Batch Normalisation (B.N.) layers are included in the algorithm, and whether pretrained weights are available in TensorFlow for the purpose of transfer learning. The DS columns indicate which of the three datasets in this study were originally used to train algorithms by Abellá, Baquedano & Domínguez-Rodrigo (2022) (DS1), Domínguez-Rodrigo et al., (2020) (DS2), Pizarro-Monzo et al. (2023) (DS3). The reference list refers to the original publication of each of the datasets. * This number includes all of the convolutional layers included within the inception modules – large blocks of multiple convolutional layers. ** A precise number cannot be reported as the architecture works differently from the others.

ALGORITHM	PARAMS.	Nº CONV	B.N.	PRETRAINED	DS			REFERENCES
ALGORITHM	PARAMS.	Nº CONV	B.N.	PRETRAINED	1	2	3	REFERENCES
Jason1	≈ 3 Mil.	8	False	False		x		Brownlee, 2017
Jason2	≈ 3 Mil.	8	True	False	x	x		Brownlee, 2017
VGG16	≈ 15 Mil.	13	False	True		x		Simonyan & Zisserman, 2015
DenseNet201	≈ 18 Mil.	201	True	True	x	x	x	Huang et al., 2017
VGG19	≈ 20 Mil.	16	False	True	x		x	Simonyan & Zisserman, 2015
InceptionV3	≈ 22 Mil.	96 *	True	True		x		Szegedy et al., 2015
ResNet50	≈ 24 Mil.	49	True	True	x	x	x	He et al., 2016
Alexnet	≈ 35 Mil.	5	True	False		x		Krizhevsky et al., 2012
EfficientNetB7	≈ 64 Mil.	**	True	True	x			Tan and Le, 2020

Further processing of images prior to the training of algorithms consisted in basic data augmentation typical of those applied in DL-based CV applications. This was performed using the ImageDataGenerator function provided by TensorFlow and the Keras API. Images were subjected to a series of transformations when loaded during training, including: 1) the rotation of images within a range of 40°; (2) both vertical and horizontal shifts amounting to a maximum of 20% of the image dimension; (3) shear transformations with a 20° intensity; 4) zooming both in and out of the image with a range of 20%; and (5) horizontal and vertical flips to mirror the image. For more efficient training, pixel values were also normalised to fall within a range of 0 and 1.

2.4. Algorithm training and evaluation

As described by the original DL analyses, algorithms were trained on 70% of the data, separating the remaining 30% for testing and evaluation. Here lies the first discrepancy with the original publications of each of these datasets. Best practices in DL dictate the use of three distinct subsets of the original dataset throughout the training and evaluation process of a DL model (Bishop 1991; 1995: pg. 372, 2006: pg. 32; Chollet, 2017 pg. 97–100; Goodfellow, Bengio & Courville, 2016: pg. 120). These three subsets are referred to as the training, the validation, and the test sets. As implied by its name, the training set is used to teach the learning model. A validation set is then used to evaluate the model’s performance during training and specifically perform hyperparameter optimisation and tuning so as to avoid overfitting on the test set. Insights gained from this evaluation are then used by the model to fine-tune its own weights and internal parameters. The test set remains entirely separate from this process and serves only as a means of evaluating model performance on unseen data.

Among the few examples of published code presented by authors in this field, Cifuentes-Alcobendas & Domínguez-Rodrigo (2019) only provided enough code up until the training and validation steps of the experiment, with no indication as to how the testing evaluation was performed. The code provided by Cobo-Sánchez et al. (2022), however, provides a little more information in the form of a Jupyter Notebook (Available from: https://doi.org/10.7910/DVN/BQTKBA [Accessed: 01/09/2023]). In this notebook, it can be seen in cells 23 and 27 of the code that the authors are using the same data for model validation (see code blocks 6, 9, and 20) and for testing. The same issue is also shown in Vegara-Riquelme et al. (2023)’s code (Available from: https://doi.org/10.7910/DVN/1OGN32 [Accessed: 01/09/2023]). In Domínguez-Rodrigo et al. (2024), the authors even directly insert the validation data into the evaluation portion of their code (Available from: https://doi.org/10.7910/DVN/MLDCIC [Accessed: 11/06/2024]). Detailed documentation of this is provided as Supplementary File 1.

These practices diverge from those used in DL and lead to an inflation in reported accuracies for final models. As stated by Goodfellow, Bengio & Courville (2016: pg. 120): “It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no example from the test set can be used in the validation set”. Using the same data for both validation and testing phases notably introduces bias and compromises the model’s true generalization performance. This is similar to the concerns raised by Calder et al. (2022) regarding training-testing contamination.

To address this concern, the present study introduces an additional split of the training data. This ensures that 20% of the training data is then designated for validation during the model’s training process. Training was then carried out for 100 epochs with mini-batch sizes of 32 images, using the Stochastic Gradient Descent algorithm (learning rate α = 0.001, momentum β = 0.9). Training was performed so as to reduce either the binary or categorical cross-entropy. When possible, transfer learning was used to improve the training process. Transfer learning involves taking models that have already been successfully trained on larger datasets, and fine-tuning the algorithms to the present datasets by only training the later layers within the model. For this purpose, and when available, the weights of algorithms trained on the ImageNet dataset were used (Deng et al., 2009). As per the workflow present in the previously published codes, all the convolutional layers from these algorithms were frozen.

To evaluate the training process, loss curves were plotted and inspected, following typical practices in DL research (Glorot & Bengio, 2010; Goodfellow, Bengio & Courville, 2016). This involves evaluating the progress of a neural network over time by assessing the performance of an algorithm when classifying the train and validation sets. A well-trained neural network exhibits a consistent decrease in loss (or increase in accuracy) over time, eventually converging at a point where the algorithm cannot learn any more from the data. Evaluation of these curves, however, must fundamentally consider the loss and accuracy of the algorithm on both the training and the validation set. A powerful algorithm ideally shows a general trend where both the train and the validation curves increase or decrease at the same time, with an eventual convergence on the final value (Figure 1). If the two curves are shown to diverge, this is a strong indicator that the algorithm is not fitting correctly to the data (Figure 1). Overfitting is typically identified when the training loss continues to decrease; however, beyond a certain point, the validation loss starts to increase. This implies that the algorithm is learning too much from the training data, and is unable to generalise to the validation data. Conversely, underfitting is diagnosed when the two curves do not converge and generally present a high separation between training and validation. Overfitting often occurs when algorithms are overly complex for the problem at hand, while underfitting is typically a product of training an algorithm on insufficient data. These issues, however, are frequently observed in combination, making the training of CNNs complex so as to avoid a combination of both overly complex algorithms with insufficient data.

**Examples of ideal train/validation learning curves.** These curves were obtained from a neural network trained on a toy dataset, alongside examples of both underfitting and overfitting neural networks.

Once trained, the models were then used to predict the labels of the test set. This is an additional, more practical means of assessing an algorithm’s ability to generalise to new data. For this purpose, predicted labels were compared with the original labels to calculate metrics such as accuracy, recall, precision, and a combination of precision and recall through the F1 score. For binary classification problems (e.g., DS1), precision-recall curves were also calculated alongside the relevant area under curve (AUC) metric. An additional metric considered is loss. In the three previously published research, loss results seem to have been reported as the categorical or binary cross-entropy values computed by the model itself. Cross-entropy loss values, while highly relevant, are often harder to interpret than simple square root-mean-squared-error (RMSE) values. For this purpose, while algorithms were trained to reduce cross-entropy loss, final evaluations of performance considered RMSE as well, including the difference between the actual label of each individual and the probability of the image being associated to that label. For a binary classification problem, therefore, RMSE is calculated by simply taking the square root of the mean values of (y – ŷ)2, where y is the original label and ŷ is the output of the network. This is a direct and easy means of interpreting how confident an algorithm is when making new predictions.

Finally, in light of observations made by multiple authors (Cifuentes-Alcobendas & Domínguez-Rodrigo, 2019; Jiménez-García et al., 2020a; Abellán et al., 2021; Domínguez-Rodrigo et al., 2021c; Cobo-Sánchez et al., 2022), the gradient-weighted class activation mapping (Grad-CAM) technique was employed to produce activation maps for each image (Selvaraju et al., 2017; 2019), highlighting the areas that the model deems relevant for the association of each image to its label.

2.5. Software

All analyses were performed in the Python v.3.10.9 programming language. Computer vision applications made use of the OpenCV v.4.7.0, Scikit-Image v.0.19.3, and Numpy v.1.22.0 libraries. DL applications were implemented using the TensorFlow v.2.12.0 framework.

All code in the present study can be found on the corresponding author’s GitHub page at; https://github.com/LACourtenay/Reevaluating_Computer_Vision_Taphonomic_Equifinality. Any additional code and figures are available from https://doi.org/10.6084/m9.figshare.24877743.v3.

3. Results

3.1. Dataset Quality

Analyses of images from each of the datasets reveal an alarmingly high percentage of images with insufficient quality for most types of CV based tasks. A considerable number of images exhibit issues such as being out-of-focus, poor levels of contrast, or specular irregularities. These challenges make it difficult to identify features accurately, while casting doubt on the general quality of such datasets as well as their suitability for DL applications.

For DS1, a median LoG variance value of 21.6 was calculated, with 95% interquantile confidence intervals of [13.9, 98.3], clearly falling below the acceptable threshold (<100) of images being in-focus. This is additionally corroborated through magnitude of FFT values being computed at 9.7 +/– [0.7, 25.1], and results from CED being able to extract identifiable features from approximately 34.8% of pixels within each image. In general, 49.4% of images were calculated to have sufficient contrast, while 46.5% of images were found to present abnormal specular highlights. This latter percentage can be corrected to 15.3% when removing possible noise from images.

For DS2, LoG values were computed to be slightly higher at 23.5 +/– [14.1, 128.3] and FFT magnitudes at 10.7 +/– [0.8, 26.7]. CED were able to extract identifiable features from an average of 37.4% of pixels, while generally the contrast of images was also found to be marginally better with 53.3% of images presenting adequate contrast levels. Specularities were detected in 54.2% of images, and 24.9% of images when removing possible outliers and noise.

DS3, being an extension of DS2, shows similar patterns with an increase in LoG values to 29.1 +/– [13.5, 143.4], however is still far below the desirable threshold (LoG < 100). While the median FFT magnitude value is found to be above the desirable threshold of 10, the 95% confidence interval [–0.71, 29.11] indicates that the majority of this sample still falls below what would be required for in depth analyses of these images. CED values are still below adequate (41.5%), while 56.6% of images present adequate contrast levels. Finally, 58.4% of images present undesirable specularities, with 33.8% of images still presenting these specularities even after noise removal.

Interestingly, when assessing these patterns by taphonomic agents, it becomes evident that the poorest quality of images is generally associated with cut marks across all three datasets. Considering both LoG and FFT values (Table 2), cut marks clearly emerge as the category with the highest number of images out-of-focus. Notably, not a single photograph of a cut mark from DS1 (n = 535) or DS2 (n = 656) exhibits LoG values above 100. In DS3, only 4 images out of 1179 images have LoG values greater than 100. From the perspective of FFT, 44.4% of images from DS1 are considered to be in-focus, 53.4% from DS2, and 41.5% from DS3. Despite this discrepancy in percentage values between LoG and FFT results, it can still be seen how a considerable portion of each of the samples is not of sufficient quality for more advanced CV-based applications. When inspecting images using gradient maps (Figure 2), many images are clearly out-of-focus towards the edge of each image, with only a small portion of the mark being in-focus in the centre of each image. This pattern is likely due to the natural curvature of the bone, combined with the limited focal depth of the microscope being used.

Table 2

Descriptive statistics of the different metrics obtained from analysed images. Descriptive statistics of the different metrics extracted from images from the different classes present in DS1 and DS3; DS2 was excluded from the table as it is contained within DS3. Metrics include the Laplacian of Gaussian (LoG) variance, percentage of image presenting detectable features using Canny Edge Detection (CED), Fast Fourier Transform (FFT) magnitudes, the percentage of images presenting adequate levels of contrast, and the percentage of images presenting complications due to the presence of specularities (Spec.). Descriptive statistics report the central tendency followed by 95% confidence intervals constructed using distribution quantiles. CM = Cut Mark, Croc. = Crocodylian tooth score, TM = Carnivoran Tooth Mark, Tmp = Trampling.

SAMPLE	LOG VARIANCE	FFT MAGNITUDE	CED (%)	CONTRAST (%)	SPEC. (%)
DS1-CM	20.6 +/– [13.8, 55.6]	9.1 +/– [0.6, 24.1]	33.4 +/– [16.3, 56.7]	45.0	11.3
DS1-Croc.	71.0 +/– [17.8, 230.8]	19.0 +/– [5.7, 30.7]	49.9 +/– [26.8, 65.7]	95.7	58.7
DS3-CM	22.1 +/– [14.1, 70.2]	8.7 +/– [0.7, 22.9]	35.7 +/– [16.6, 66.3]	29.6	9.8
DS3-TM	41.9 +/– [13.6, 133.3]	16.2 +/– [–0.2, 28.0]	48.3 +/– [14.6, 68.2]	80.9	62.1
DS3-Tmp	44.1 +/– [10.2, 378.4]	9.7 +/– [–6.7, 93.2]	45.2 +/– [2.7, 87.8]	47.3	40.0

**Examples of images displaying exceptionally poor quality.** Examples of photographs of cut marks from DS1 and DS3 presenting especially poor image quality, with a considerable portion of pixels out-of-focus towards the image border. Sobel gradient maps in the right-hand panels highlight these features with sharp changes in gradient being clearly visible in the centre of each image, while gradients towards the edges present a high degree of out-of-focus blur with almost no detectable features.

Finally, most photographs of cut marks have insufficient contrast between dark and light areas. Correspondingly, the analysis of extreme variations in brightness to identify the presence of intense specular highlights shows that images of cut marks are in fact the sample to be least affected by this defect. Conversely, images of tooth marks generally present the highest level of undesirable piques in light intensity across images (Figure 3).

**Examples of photographs presenting specular reflections.** Examples of photographs of tooth marks from DS1 and DS3, presenting areas of abnormally intense brightness in certain pixels as a product of specular reflections. Panels on the right present pixels where these abnormalities have been detected.

3.2. Classification Results

As opposed to the results originally published by Domínguez-Rodrigo et al. (2020), Abellán, Baquedano & Domínguez-Rodrigo (2022), and Pizarro-Monzo et al. (2023), most of the algorithms were found to overfit on the data. This discrepancy with the originally published high accuracies (DS1 = 99.0%; DS2 = 92.0%; DS3 = 97.5%) is likely due to the contamination occurring by using the validation set for both validation and testing steps. Code, figures, and evaluation metrics for all of these algorithms have been provided at https://github.com/LACourtenay/Reevaluating_Computer_Vision_Taphonomic_Equifinality.

Three different DL-based CV models were found to perform relatively well on the three different test sets. For DS1, the best performing model was found to be the Jason2 model (Bronwlee, 2017), reaching a 89.0% accuracy on the test set with an F1 score of 0.47, AUC of 0.98 (Table 3), and overall RMSE loss of 8.6% (Table 4). It is important to point out, however, that the confidence that this algorithm has is strongly conditioned by the imbalanced nature of the dataset, where the loss for predicting crocodylian tooth marks is considerably higher (24.4%) than for cut marks (4.4%). Likewise, it is very important to observe the results of the confusion matrix (Table 5), where the algorithm is evidently classifying all marks as cut marks, regardless of their original label. From this perspective, while the true negative rate is optimistically high (TNR = 1), the true positive rate is 0, implying that the algorithm performs poorly. This leads to a global precision of 0.46 and recall of 0.50, which is far below optimal. Analysis of learning curves (Figure 4) finds a near-complete lack of fluctuation in validation accuracy values, implying that the algorithm is not learning anything new from the validation data for prolonged periods of time. This is generally a classic sign of insufficient data being provided for validation.

Table 3

Evaluation metrics when predicting the different types of BSMs using the models trained on each dataset and evaluated on the test set. Note that a high accuracy value does not necessarily imply positive performance, as evidenced by the (imbalanced) DS1 dataset.

	DS1	DS2	DS3
Precision	0.46	0.77	0.88
Recall	0.50	0.69	0.87
F1	0.48	0.66	0.88
Accuracy	0.92	0.86	0.91

Table 4

Error rates (in %) when predicting the different types of BSMs using the models trained on each dataset and evaluated on the test sets. Error rates are reported as the RMSE of the labels.

	DS1	DS2	DS3
Tooth Score	–	15.34	7.95
Trampling	–	55.30	22.35
Cut Mark	5.29	10.14	7.90
Crocodile	24.41	–	–
Overall	8.63	15.13	9.77

Table 5

Confusion matrix obtained when evaluating a Jason2 model trained on the DS1 test set. Note that the confusion matrix presents a true positive rate of 0; the algorithm classifies all samples as cut marks regardless of whether they are or not.

		PREDICTED
		CROCODILE	CUT MARK
True	Crocodile	0	13
	Cut Mark	0	146

**Empirical learning curves for neural networks.** Learning curves obtained from the best performing convolutional neural network architectures on each of the datasets; Jason2 for DS1, VGG16 for DS2, and DenseNet201 for DS3.

For DS2, the VGG16 model (Simonyan & Zisserman, 2015) was the only model capable of learning from the data, yet presents a poor fit to the training and validation data, as seen through learning curves (Figure 4). This algorithm reached an 86.0% accuracy, with an F1 score of 0.85 (Table 3). Overall RMSE loss values were reported at 15.1% (Table 4), with this algorithm performing seemingly well at the identification of cut marks with an 89.9% confidence when making predictions. This algorithm, however, loses confidence (down to 84.7%) when classifying tooth scores, and has an exceedingly high loss (44.7%) when faced with trampling marks (Table 4). Confusion matrices present a relatively high number of false positives and negatives (Table 6), while trampling marks are often mistaken for other traces such as tooth marks (trampling marks: Precision = 0.8, Recall = 0.2). In general, the highest confusion rates can be observed for the misclassification of carnivoran tooth marks, predicting them to either be cut or trampling marks (tooth marks: Precision = 0.5, Recall = 0.9).

Table 6

Confusion matrix obtained when evaluating VGG16 on the test set of DS2 and DenseNet201 on the test set of DS3.

		PREDICTED DS2			PREDICTED DS3
		CUT MARK	SCORE	TRAMPLING	CUT MARK	SCORE	TRAMPLING
True	Cut Mark	134	11	1	163	2	6
	Score	2	28	0	7	126	3
	Trampling	1	13	4	1	11	33

Finally, for DS3, the best performing model was the DenseNet201 algorithm (Huang et al., 2017). This algorithm appeared to train the best with the least amount of overfitting during training/validation, however still presents a certain degree of underfitting (Figure 4). This algorithm obtained a 91.0% accuracy on the test set, with an equally high F1 score of 0.88 (Table 3). Nevertheless, loss values in general indicate that the algorithm lacks confidence when classifying trampling marks, as seen by the higher loss rates (22.4%), when compared with cut (8.0%) or tooth marks (7.9%) (Table 4). In general, this is the algorithm with the most stable confusion matrix (Table 6), however it still presents imbalance for the case of trampling marks. The algorithm VGG19 could also be considered as a close contender for this dataset, reaching an 83.0% accuracy. Nevertheless, in this case, the algorithm was unable to overcome limitations imposed by sample imbalance, with the algorithm only being able to correctly identify 4 trampling marks, resulting in a recall value as low as 0.09.

From the perspective of Grad-CAM results, while some activation maps focus on the centre of the image, highlighting the mark itself, a considerable number of images display undesirable results, with the CNN learning from areas of the images that are not necessarily in-focus or displaying poor lighting conditions (Figure 5). This is a clear indication that the algorithm is not learning from areas of relevance for the characterisation or identification of BSMs. In a number of cases, it can be seen how CNNs are learning features from areas of poor lighting conditions or pixels that are out-of-focus.

**Grad-CAM results displaying suboptimal detection of features.** Grad-CAM results for a selection of images displaying particularly poor identification of relevant features for BSM classification. The lighter shades of yellow highlight areas where the CNN are identifying notable features. Darker areas leaning more towards blue indicate areas that are not of interest to the CNN when identifying each type of BSM.

4. Discussion

The present study has brought attention to several challenges associated with the proposed application of DL-based CV for the classification of different BSMs. Firstly, empirical evidence highlights the suboptimal quality of the three datasets analysed here. Secondly, upon adopting an appropriate training strategy, with an adequate allocation of training, validation, and test sets, the results obtained are not as promising as originally proposed. Furthermore, issues related to both overfitting and underfitting become apparent when examining the learning curves.

To date, however, few studies have discussed the limitations of DL and how these limitations may affect our own case studies. Some of these limitations will be highlighted below, paying particular attention to those relevant to DL-based CV.

4.1. DL-based CV is only as powerful as the datasets provided

The success of any analysis, from the simplest of statistical analyses, to highly complex use of AI algorithms, is intricately tied to the quality of the data we have. The importance of data quality cannot be overstated, as data is at the core of any field of science. Data is used to model and represent real-world phenomena in a comprehensible format that is ideally free from bias. However, as bias increases, our ability to produce accurate or reliable predictions substantially decreases (Gong et al., 2023). This is most effectively summarised by the popular adage in computer science “garbage in, garbage out” (GIGO) (Srinivasa-Desikan, 2018).

In CV, image quality is a fundamental prerequisite for any type of analysis, requiring that images present an acceptable degree of contrast and sharpness to be able to detect features within them. Lighting also assumes a critical role: variations in illumination are crucial for the detection of texture, while sufficient contrast between dark and light areas of an image directly allows algorithms to detect patterns and structure. In microscopy, the depth of field of the lens is another important component to take into consideration to ensure sufficient image quality. In addressing this aspect, it becomes crucial to ascertain that the entire image is in-focus, using a lens with an appropriate focal distance or by employing focus stacking techniques (Fedorov, Sumengen & Manjunath, 2006; Tian & Chen, 2012; Zhang et al., 2013).

The present study reports several empirical measures revealing that the quality of previously published image datasets is insufficient for many analyses, while disparity exists between the quality of images in different samples. A significant portion of the images exhibit inadequate control of focal distance. Despite claims by the curators of each dataset that lighting conditions were controlled, we document numerous lighting issues. Owing to a lack of details on how lighting conditions were controlled, this casts serious doubts about the reproducibility of the curators’ methodology. In Domínguez-Rodrigo et al. (2020), authors state “images were taken in this magnification [30×] using the same light intensity and angle”, however, the current analysis reveals non-uniform lighting conditions, resulting in noise from spectral reflectance as well as poor image contrast. Images, such as those in Figure 4, exhibit an intense reflection of light in the top left corner, contrasting with the considerably darker rest of the image. If the bone was therefore oriented differently with respect to the light, reflections would occur somewhere else on the object, and thus condition the appearance of the BSMs. In Byeon et al. (2019), authors describe how histogram equalisation was used to correct these possible issues. However, none of the following studies mention this step again. Likewise, histogram equalisation is not included in the workflow presented in any of the published code. Nevertheless, such image corrections do not necessarily solve the problem if the images are of poor quality to begin with.

Considering the limited area in-focus on each image, the present study shows with converging evidence that DL-based CV trained on BSMs poses significant challenges. Image quality analyses are inherently difficult: quality estimates are sensitive to the scene or element depicted (Mittal, Moorthy & Bovik, 2012; Wang, Sheikh & Bovik, 2002), as well as imaging modality. This explains the discrepancies between LoG and FFT when discerning whether an image is in-focus or not, due to the way in which pixel gradients are calculated and evaluated, and the fact that these metrics have not necessarily been designed for the particular task of evaluating images obtained with a microscope. Nevertheless, regardless of the metric used, a considerable portion of the samples has been shown to be problematic (>44.4%), while simple visual inspection of the images reveals the datasets to be questionable.

Problems related with focal depth are particularly noteworthy for the cut mark samples, identified as the most problematic in this regard. In this regard, the GIGO adage is perhaps more relevant than ever, as DL algorithms may be more likely learning that cut marks are blurry and crocodylian tooth marks are not, rather than focusing on relevant features in cut mark analysis. From this perspective, it is interesting to point out that algorithms presented by Domínguez-Rodrigo et al. (2020) predominantly classed the Dikika (Ethiopia) marks as trampling marks, while the few images from Anjohibe, Itampolo, and Christmas River (Madagascar) presented unconfidently low classification probabilities of being cut marks. While the current authors refrain from expressing an opinion on the validity of claims regarding these marks as cut marks, it is crucial to highlight that the data presented by other authors concerning their classification using DL-based CV is insufficient to draw conclusions.

To develop this point further, it is important to additionally address the need not only for high quality images, but for a more homogeneous standard of images across the entire dataset. This would help algorithms to avoid from honing in on differences between samples due to differences in image quality, as opposed to what the images actually contain. Our results clearly show that some of the samples have different properties than others, based almost entirely on how the images were acquired. CNNs are well known for making decisions based on differences between images that are not relevant to the task at hand. A famous example of this is the example of CNNs classifying images of dogs and wolves based on if snow is present in the background, as opposed to the actual animal itself (Ibrahim and Shafiq, 2023), alongside a number of other unpublished anecdotal examples. From this perspective, the acquisition of data needs to ensure that there is no bias regarding quality across samples, beyond the evident need for high quality data on a global scale.

While attempts have been made to develop algorithms that are able to successfully learn from poor quality images (e.g., Zou & Yuen, 2012; Koziarski and Cyganek, 2018), these studies are only designed to confront very particular problematic research questions where obtaining higher quality data is difficult or impossible. Nevertheless, many of these approaches still present difficulties, and remain far from perfect (Koziarski and Cyganek, 2018). When good quality data can be obtained, such as taking photographs of BSMs, all efforts must be made to ensure that the data acquisition protocol is as rigorous as possible, and the resulting images meet the appropriate quality assurance thresholds.

Finally, we must point out that the most recent publication by Domínguez-Rodrigo et al. (2024), that came out after the first version of the present study, is the first to confront the issue of image quality, using microscopes that are conditioned to obtain images that are entirely in-focus. From this perspective, some efforts have been made to obtain higher quality images, however this does not fix issues with the previous studies by the same authors, especially those applied to archaeological data. Likewise, Domínguez-Rodrigo et al. (2024) still presents issues with the code (see Supplementary File 1), and a number of other fundamental theoretical questions that will be discussed in continuation.

4.2. Not so intelligent artificial intelligence

In recent years, DL models have grown in popularity across all fields of science, with studies indicating that, in some cases, AI algorithms are able to perform to a similar level as humans at certain tasks (Esteva et al., 2017), or claims that they may even be better (Domínguez-Rodrigo et al., 2020). Nevertheless, the field of DL is constantly growing and research increasingly shows that algorithms are still far from perfect. From this perspective, many fields remain reticent about allowing DL models to be employed in real world situations and professional practices, particularly in biomedical research (Roberts et al., 2021). Research in DL has shown that CNNs are highly sensitive to the data provided as input (Szegedy et al., 2014; Goodfellow, Schlens & Szegedy, 2015; Dodge & Karam, 2016; Su, Vargas & Sakurai, 2019). Neural networks (NN) are an attempt to use mathematical and computational resources to replicate the functionality of a biological brain, and thus make informed decisions based on data to which they are exposed. The issue is, however, that to date these algorithms are still inherently flawed, lacking fundamental natural cognitive attributes such as common sense (Vedantam et al., 2015; George et al., 2017), true generalised knowledge across different tasks (Ben-David et al., 2010; Finn, Abbeel & Levine, 2017), combined multimodal knowledge across multiple domains (Vedantam et al., 2015; Vinyals et al., 2015; Agrawal et al., 2016), leveraging of prior knowledge (Fink, 2004; Santoro et al., 2016), foresight (Finn & Levine, 2017), and the ability to say “I don’t know” when confronted with entirely alien information (Nguyen, Yosinski & Clune, 2015; Dhamija, Günther & Boult, 2018). These attributes form a fundamental component of biological cognition and conditions how we make decisions in our everyday lives.

In the case of modelling from multivariate numeric data, such as measurements, training requires algorithms to be aware of the underlying statistical properties of the data: high overlap between complex multivariate and non-parametric distributions implicate a need for increasingly more complex algorithms to model and make sense of this information (Diakonikolas et al., 2017; 2019). This challenge is particularly pertinent in the context of DL-based CV, where models must exhibit invariance to the translation, scale, and orientation of the features and elements they are trained to detect. Taphonomists have attempted to tackle this issue by using data augmentation (Cifuentes-Alcobendas & Domínguez-Rodrigo, 2019; 2021; Domínguez-Rodrigo et al., 2020; Jiménez-García et al., 2020a; Pizarro-Monzo & Domínguez-Rodrigo, 2020; Abellán et al., 2021; Abellán, Baquedano & Domínguez-Rodrigo, 2022; Domínguez-Rodrigo et al., 2024), transforming images slightly to enhance the model’s capacity to detect these features. Nevertheless, this process has its limitations as it only exposes the model to familiar variation. Another consideration within this domain is the influence of visual stimuli like colour and light. While conversion of images to 8-bit greyscale images with 3 channels (R=G=B) is a useful means of reducing the complexity of information to learn from, pixel intensity remains inherently linked to light conditions, complicating complete elimination of this dependency.

Arguably, for studies when algorithms are trained from scratch, conversion of the images to single channel greyscale images would be a much more efficient means of reducing dimensionality, as well as computational complexity. To clarify this point, if we were to convert images from DS1 to 8-bit single-channel greyscale images, then the first layer of the Jason2 architecture would only have 320 parameters to learn, as opposed to the 896 parameters involved in learning from three-channel images. Regardless of whether the R=G=B property of these images implies a reduced numerical complexity to the dataset, with two channels containing redundant information, the algorithms still have to learn from these redundancies. Nevertheless, conversion to single channel images would remove the possibility of performing transfer learning from ImageNet, which was originally trained on RGB images, where R≠G≠B. While the original authors of the datasets could have carried out their conversions and analyses on these modified images solely for the purpose of being able to perform transfer learning, no justification or explanation was provided in the original publications of these methods. Moreover, from this perspective, it is important to point out that no information has been provided by the original authors as to how greyscale conversion was carried out, making the pooling of these datasets with other datasets by other authors a bit more difficult and potentially problematic. The original authors should either clarify how this pre-processing was carried out, or directly provide the original images so that future users can carry out pre-processing steps themselves (more detailed notes on this are also available on the associated GitHub page of this study: https://github.com/LACourtenay/Reevaluating_Computer_Vision_Taphonomic_Equifinality).

In the context of BSM analyses, it is important to consider that we are generally working with microscopes, each possessing unique optical properties. For example, the resolution of cameras can strongly differ on different microscopes. Here it could be argued, therefore, that comparing images obtained using different microscopes and sensors could possibly be problematic (Kothari et al., 2014; Fortin et al., 2018; Da Rin et al., 2022), a hypothesis that is yet to be tested or validated for taphonomic research. In the past, CNNs have been shown to be able to detect the make and model of the camera being used to take images (e.g., Tuama, Comby & Chaumont, 2016; Athanasiadou, Geradts & Eijk, 2018; Kuzin et al., 2018), implying that nuances in camera hardware and optical characteristics might introduce subtle artefacts into the captured images which can be detected by DL models. Likewise, understanding the internal parameters of cameras is a fundamental step and component of CV and photogrammetry (González-Aguilera, 2005; Li & Liu, 2018). Considering both perspectives, it becomes evident that the specific imaging technique could influence the numerical properties of each image. While these discrepancies might not be crucial when comparing two different macro cameras, microscopes generally exhibit substantial variability in image quality, resolution, and data type (e.g., Chen, Zheng & Liu, 2011; Leach, 2011; Borel et al., 2014; Arman et al., 2016; Calandra et al., 2019; Martín-Viveros & Ollé, 2020).

In Pizarro-Monzo et al. (2023), the authors describe that documentation of archaeological traces were performed using both confocal and digital microscopy, yet the documentation of experimental datasets employs optical microscopes (Byeon et al., 2019). Likewise, Moclán et al. (2024) used digital microscopy to photograph the archaeological samples used for classification in their study. Finally, Domínguez-Rodrigo et al. (2020) published images taken using a number of different techniques from other archaeological and palaeontological sites to be used as input into algorithms. Until empirical investigations are conducted to determine the potential impact of imaging modalities on prediction outcomes, any results obtained by comparing a DL algorithm trained on experimental samples from one microscope type with palaeontological or archaeological traces documented using another, must be approached with extreme caution or completely discarded.

AI algorithms have often been labelled as ‘black box’ models due to their intricate and complex internal mechanisms (Lent, Fisher & Mancuso, 2004; Abadi & Berrada, 2018), obscuring a direct comprehension of the way input data get converted into output predictions or decisions. This lack of transparency is concerning given the interpretability and potential biases that could exist within these algorithms. To address this challenge, a popular technique in DL-based CV is gradient-weighted class activation mapping (Grad-CAM). This technique harnesses intermediate layers within a trained neural network to produce activation maps, effectively pinpointing the specific areas of the image that contribute to the model’s final output (Selvaraju et al., 2017; 2019). Certain studies have used Grad-CAM to demonstrate that DL-based CV algorithms function as expected (Cifuentes-Alcobendas & Domínguez-Rodrigo, 2019; Jiménez-García et al., 2020a; Abellán et al., 2021; Domínguez-Rodrigo et al., 2021c; Cobo-Sánchez et al., 2022), as gradient mappings highlight relevant features such as the shoulders and inner striations of marks. However, here we demonstrate that Grad-CAM only emphasizes these areas for specific examples within the test set, raising concerns about the potential bias in the selection of Grad-CAM to be showcased.

4.3. Some key notes on sample size and the transferability of learned features in CNNs

Throughout this discussion, several limitations and challenges have been presented regarding the field of DL, and more importantly DL-based CV, with one general consensus highly sensitive to the numeric values used as input (Carlini & Wagner, 2017; Kurakin, Goodfellow & Bengio, 2017; Madry et al., 2019), as they are unable conditions. Some research has even shown that modifying the values of just a single pixel is enough to cause CNNs to fail (Szegedy et al., 2014; Nguyen, Yosinski & Clune, 2015; Su, Vargas & Sakurai, 2019). We can fairly assume that CNNs trained on experimental taphonomic data are not robust enough to aptly distinguish the true natural variability of BSM appearance, which will become increasingly more apparent as we include the progressive overlapping nature of the taphonomic record. This evidently includes the possible distortion that occurs to most BSMs appearance caused by biostratinomic, diagenetic, and trephic processes (Pineda et al., 2014; 2019; 2023; Gümrükçu & Pante, 2018; Valtierra, Courtenay & López-Polín, 2020), or in taphonomic contexts that condition the texture and appearance of cortical surfaces, such as manganese or oxide staining.

This is not to say, however, that DL-based CV may not obtain high accuracies for palaeontological and archaeological applications in the future. Before this can happen, however, a critical look at the samples being used is needed, especially in terms of sample size, quality, and balance.

AI applications often demand substantial datasets, particularly in the case of neural networks and CNNs. For instance, the pioneering work by LeCun et al. (1998) achieved >99% accuracy using a dataset of 60,000 monochrome 28 × 28-pixel training images (MNIST dataset). From 2009 to 2012, significant strides in AI were made using two expansive datasets: one containing 60,000 coloured 32 × 32-pixel images (the CIFAR-10 dataset) and another featuring 15 million coloured 469 × 387-pixel images (the ImageNet dataset). Despite the challenges faced by authors in reaching an 80% accuracy mark in image classification in many instances (Deng et al., 2009; 2010; Krizhevsky & Hinton, 2009; Krizhevsky, 2010; Krizhevsky, Sutskever & Hinton, 2012), these studies remain among the most pivotal advancements in DL-based CV history. Notably, Szegedy et al. (2015) introduced innovations in CNN architectures that achieved 90% classification accuracy using 1.35 million images from the ImageNet dataset. Similarly, Redmon et al. (2016) surpassed 90% accuracy in object detection and recognition by employing 9 million coloured video frames from the ImageNet10k dataset.

For numerical quantitative analyses, calculating an optimal sample size achieved by reducing as much as possible the variable to individual ratio. Unlike tabular datasets with a fixed number of columns representing features, images are inherently more complex due to their high-dimensional nature. While CV-based applications do typically use raw pixel values as input, they do not process them independently, and, therefore, the number of pixels is not necessarily a relevant measure of the number of variables. CNNs instead move across the image using multiple filters to learn hierarchical features that are then passed on to other layers that make sense of these features (Bengio, 2012; LeCun, Bengio & Hinton, 2015; Goodfellow, Bengio & Courville, 2016). Consequently, the number of parameters (i.e., the networks learnable weights), is what contributes to model complexity, while the focus of the study should therefore be to ensure that models generalise well to new, unseen images, based on these parameters.

The present study has shown three algorithms to be the top performing models: Jason2 presents the lowest number of learnable parameters at ≈ 3.4 million, followed by VGG16 with ≈ 16.3 million, and finally DenseNet201 with ≈ 28.4 million parameters. Even when presented with datasets of over 500 images, the number of parameters far exceeds the number of images to learn from, while most algorithms do not generalise well to new data (i.e., the test sets).

Evidently, when trained from scratch, each of these algorithms has an excessive number of parameters given the size of the datasets at hand. A means of overcoming this issue is to use transfer learning, where the models are originally trained on a much larger dataset and then parameters are fine-tuned to our own smaller dataset. When assessing previously published code, it can be seen how previous authors performed this task by freezing all the convolutional layers of each algorithm (Domínguez-Rodrigo et al., 2020; Pizarro-Monzo et al., 2023). In the case of VGG16 and DenseNet201, this drops the number of learnable parameters to ≈ 1.6 million (9.8% of the initial number) and ≈ 10.3 million (36.3%), respectively. The present study, following the instructions of Domínguez-Rodrigo et al. (2020) and Pizarro-Monzo et al. (2023), carried out transfer learning by using the already existing weights of each of these algorithms when trained on the ImageNet dataset. Nevertheless, there are some important aspects of the ImageNet dataset that have to be considered when using transfer learning in BSM applications.

The ImageNet dataset consists of approximately 15 million colour images of over 20,000 different objects taken with standard cameras. While the training set is substantially larger than any BSM datasets currently available, transfer learning works best when the source and the target classification tasks share some level of similarity or underlying structure (Raghu et al., 2019; Ray, Raipuria & Singhal, 2022). When the tasks are related, the knowledge acquired from the source task is more likely to be relevant and beneficial for the target task (He, Girschick & Dollár, 2018; Kornblith, Shlens & Le, 2018). Essentially, features learned from the source data can capture common patterns, representations, or concepts that might also be present in the target data. This shared knowledge can then act as a foundation for the model to generalise more effectively to the target task.

Clearly, the ImageNet dataset differs significantly from the current BSM datasets. The ImageNet dataset contains classes such as ‘minivan’, ‘balloon’, ‘microphone’, and ‘syringe’, which represents a considerable domain gap with taphonomic alterations on bone obtained using an optical microscope. Likewise, task complexity between ImageNet and BSMs is considerably different: equifinality is not a potential source of confusion between minivans and balloons but does affect the distinction between cut marks and trampling marks. This is not to say that transfer learning is not relevant for taphonomic research (He, Girschick & Dollár, 2018), but it cannot be used to substitute considerably larger sample sizes. For this reason, it is important to talk about the precise mechanisms behind transfer learning, and more experimentation is required to ensure that the freezing of all layers is actually useful.

When discussing transfer learning in the context of CNNs, it is essential to understand layer hierarchy. The first few layers of a CNN are typically designed to capture low-level features, such as edges and texture. As we progress into the deeper layers of the CNN, we begin to find filters that are capable of detecting more complex patterns, eventually reaching the deepest layers that are optimal for learning object representations (Bengio, 2012; Yokinski et al., 2014; LeCun, Bengio & Hinton, 2015; Goodfellow, Bengio & Courville, 2016). As evidenced in previously published code, the current use of transfer learning in BSM applications freezes all the layers, even those that may be irrelevant to the identification of precise features in cut, trampling, or tooth mark morphology. From this perspective, it is essential to investigate the utility of these deeper layers. Thus, rather than freezing all layers that retain potentially irrelevant information for the task at hand, one might consider freezing only a certain number of layers in each CNN (Raghu et al., 2019). Experimentation should at least be done to ensure that this approach is the most valid for the task at hand.

In light of previous statements in section 4.2, it is also important to consider that ImageNet contains coloured images, and the original weights of the algorithms are therefore adapted for three channels with varying degrees of complexity. The conversion of images to 8-bit three-channel greyscale results in all three channels containing the exact same values for each pixel, which the original weights derived from ImageNet were most certainly not exposed to. While the first few layers, focusing on edges and texture, may not be fully dependent on the presence or absence of colour, it is important to consider that other layers that the authors freeze may be more sensitive to such variability.

Sample balance is also a fundamental component to take into consideration. Sample imbalance is a problem where there is an unequal distribution of examples among a set of known classes (He & Ma, 2013). Using the formulae described in Courtenay (2023) to define a Balance (B) index between 0 (imbalance) and 1 (balance), the three datasets considered in our study vary between moderate (DS2: B = 0.67) to severe (DS1: B = 0.42) imbalance. Jiménez-García et al. (2020a) originally argued that sample imbalance is not an issue for their very small and largely imbalanced dataset (lion tooth marks: n = 209, jaguar tooth marks: n = 42; B = 0.65), yet later provided a corrigendum to recant this statement and presented ensemble learning as a solution (Jiménez-García et al., 2020b). Here we demonstrate how sensitive DL algorithms are to imbalance, with the smaller classes always presenting the highest loss rates, and confusion matrices reflecting algorithms to be less likely to predict the smaller class as well. Likewise, rebalancing is still an open line of research that needs a deeper understanding of latent features, especially in terms of CNNs (Dablain et al., 2023). When considering ensemble learning as a potential solution to overcome this issue, most architectures tested here were found to overfit even on the largest dataset (DS3), providing a limited number of algorithms to stack. Similarly, the inappropriate evaluation of algorithms on the same data used for validation in the original studies leaves the reliability of such solutions problematic. While it is true, however, that ensemble learning approaches can often be much more efficient than single neural networks (Shanmugavel et al., 2023), it is still generally suggested that base learners of an ensemble approach should be as accurate as possible to begin with.

It is important to additionally point out a number of theoretical issues with the proposed use and implementation of ensemble learning by previous authors (Domínguez-Rodrigo et al., 2020; 2024; Jiménez-García et al., 2020b; Abellán et al., 2021; Vegara-Riquelme et al., 2023), that have also been highlighted in our annotation of some of the published code (Supplementary File 1), and further documentation in Supplementary File 2. It is fundamental that we highlight that, while ensemble learning approaches are an extremely powerful means of overcoming multiple limitations that single models may face, they have their own intricacies and require specific training strategies to be correctly implemented and evaluated. In each of the applied taphonomic case studies, authors state that they use stacking techniques to implement ensemble learning.

Stacking involves the combination of individual learners, known as first-level learners, whose output are used as input into a second-level learner, that is used to make an aggregated final prediction (Wolpert, 1992). Nevertheless, this process requires additional splitting of the data used for training and evaluation, so as to truly avoid overfitting and properly evaluate ensemble results (Zhou, 2012). It is clearly stated in the literature that stacking requires the first-level learners (in this case CNNs) to be trained on data that are separate from what is used to train the second-level learner (in the case of Domínguez-Rodrigo et al., 2020, a random forest algorithm). This would, in turn, require that testing and evaluation are performed on an additional test set, while validation is still required to optimally train the first-level learner, resulting in four distinct splits of the input data. None of this has actually been carried out by the aforementioned authors, meaning that we cannot truly assess the accuracy of the ensemble learning methods proposed. In Supplementary File 2, we carry out this analysis through correct implementation of the ensemble learning workflow, and find that while Domínguez-Rodrigo et al. (2020) published a 90% accuracy on their data, our experiments yield accuracies that range between 17% and 91%, and F1 scores between 0.1 and 0.91, depending on how the training is performed. Needless to say, recall values for the minority samples are far below the optimal threshold (≈ 0.3 to 0.5), even for instances where the algorithms reach ≈ 90% accuracy, while we still cannot be sure that the algorithms are not simply identifying blurry images from non-blurry images.

In summary, DL-based CV must overcome several critical issues before the algorithms can be considered robust enough to efficiently distinguish variation in BSMs. This limitation can be overcome by obtaining considerably larger sample sizes. However, the issues discussed thus far are not limited to DL-based CV analyses. From an auto-critical perspective, the use of neural networks for processing geometric morphometric data in taphonomy (e.g., Courtenay et al., 2021) is likely to be affected by similar problems, especially when sample sizes are even smaller than those described here. Ultimately, neither method is inherently superior, as both are susceptible to the limitations inherent in DL research, and much more work is needed. Further work into ensuring that DL algorithms are more robust and trained with larger sample sizes will thus be a valuable contribution to taphonomic research, regardless of the type of data used, i.e., images, landmarks.

5. Conclusions

The present research has raised a number of issues with the use of DL-based CV for taphonomic research. First, we have shown that image quality of published datasets is far from optimal for most types of analyses, which require a more rigorous and reproducible description and implementation of the methodological workflow. In addition, we have rigorously revised the training of CNNs, and have tried to replicate the results, revealing previously published high classification rates to be overly optimistic, produced by the inadequate training of algorithms. Finally, we have drawn attention to the fact that, in order for ensemble learning to overcome many of the discussed issues, a more appropriate training and evaluation strategy is required to ensure overfitting has not occurred.

Equifinality is clearly a problem in taphonomy, and research into the use of DL is highly valuable to clear up doubts that could arise about a number of sites. Notable applications would include questions regarding sites such as Dikika (McPherron et al., 2010) and Quranwala (Malassé et al., 2016). Nevertheless, considering the nature of the taphonomic record and the fragility of CNNs when exposed to virtually imperceptible noisy inputs, it is likely that algorithms will easily misclassify actual palaeontological and archaeological specimens if overlying taphonomic processes influence BSM appearance.

It is important to point out that the take-home message of the present research is not to say that DL-based CV cannot be applied to palaeontological and archaeological case studies, but rather that it should not be applied until some fundamental concerns are resolved. The present study therefore wishes to promote the following points, in support of previously published criticisms (Calder et al., 2022; McPherron et al., 2022; Yezzi-Woodley et al., 2024), as guidelines for future publications:

The quality of any dataset used for computational learning applications is paramount, and datasets must be curated and kept to the highest standards.
We strongly suggest that the review process prior to publication of this type of research be much more rigorous, especially concerning the quality of datasets that are published.
More transparency and reproducibility is required with the publication of this type of research. This includes the fundamental requirement that all code be provided in their entirety. This can be extended to the publication of all datasets as well. It is also important that the authors either publish the original images, without preprocessing, or clearly state how preprocessing (like greyscale conversion) was performed, to ensure replicability and reproducibility.
Datasets must be larger, of higher quality, more balanced, and analogous with what will actually occur in the palaeontological and archaeological record.
Experiments must be run to assess the validity of comparing images obtained using different imaging techniques and different microscopes. Likewise, authors should be transparent with the precise properties of the microscope or camera.
Training, validation, and test sets must fulfil the requirements of best DL practices: the test set cannot be used for both validation and testing. If ensemble learning is being performed, then an additional split is required to produce a training set for the definition of the second-level algorithm.
If Grad-CAM is used to justify the functionality of an algorithm, then all instances from the test set must be included, not just a selection. Cherry-picking should not be considered a means of scientifically proving a concept.
Care must be taken when assuming that transfer learning will solve all sample size-related problems due to limitations in domain crossover. Furthermore, experimentation should be carried out to assess the reliability of freezing different numbers of layers during this process.
When dealing with sample imbalance, the appropriate evaluation metrics must be employed (He & Ma, 2013), while careful inspection of confusion matrices is required. This is fundamental considering how simple reports of algorithm accuracy as often misleading.

In sum, we acknowledge that future palaeontological and archaeological research may benefit from the application of developments in computer sciences to improve objective, replicable interpretations on central issues that relate to our evolutionary history. However, to achieve this admirable goal, we must ensure complete transparency both in the material used and the methods implemented to adapt novel analytical approaches, or else we risk contributing to discredit altogether such commendable scientific endeavours.

Data Accessibility Statement

All datasets used within this study are publicly available and have been referenced and described in detail in the main body of the text. All code is available from the corresponding author’s GitHub page: https://github.com/LACourtenay/Reevaluating_Computer_Vision_Taphonomic_Equifinality. Any additional information or figures can be consulted in the supplementary files, or on the online repository https://doi.org/10.6084/m9.figshare.24877743.v3.

Additional Files

The additional files for this article can be found as follows:

Supplementary Materials

Supplementary File 1. DOI: https://doi.org/10.5334/jcaa.145.s1

Supplementary Materials

Supplementary File 2. DOI: https://doi.org/10.5334/jcaa.145.s2

Acknowledgements

We would like to thank the Bone Surface Modifications working group of the Taphonomy European Network (TaphEN), alongside Francesco Boschin, Jean-Philip Brugal, and Philippe Fosse. We are very grateful as well to three anonymous reviewers for their very helpful and constructive comments over four round of review. We sincerely hope that, in future, other papers on this topic will be subject to such an exhaustive review process, ensuring the highest possible quality of archaeological science.

We would like to thank the helpful comments and suggestions of Olivier Saut regarding certain aspects of this paper.

Funding Information

This research benefited from financial support by the following agencies: ERC Synergy QUANTA (grant # 951388) (LAC); IdEx “Talent” programme by the University of Bordeaux (grant # 191022-001) (LD); French government in the framework of the University of Bordeaux’s IdEx “Investments for the Future” program/GPR “Human Past” (LAC, AS, LD); ERC Starting Grant ExOsTech (grant # 101161065) (LD); National Scientific Research Centre (CNRS) of France through the 80 prime program within the framework of the project PHECOPAAD (AS).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

L.A.C. Conceptualization, Methodology, Software, Formal Analysis, Investigation, Writing – original draft, review and editing, Visualization, Project Administration.

N.V. Conceptualization, Validation, Writing – review and editing.

L.D. Validation, Writing – review and editing.

A.S. Conceptualization, Validation, Writing – review and editing, Project Administration.

Deep Learning-Based Computer Vision Is Not Yet the Answer to Taphonomic Equifinality in Bone Surface Modifications

Full Article

1. Introduction

2. Material and Methods

2.1. Datasets

2.2. Computer Vision and Image Analysis

2.3. Deep Learning Algorithms

Table 1

2.4. Algorithm training and evaluation

Figure 1

2.5. Software

3. Results

3.1. Dataset Quality

Table 2

Figure 2

Figure 3

3.2. Classification Results

Table 3

Table 4

Table 5

Figure 4

Table 6

Figure 5

4. Discussion

4.1. DL-based CV is only as powerful as the datasets provided

4.2. Not so intelligent artificial intelligence

4.3. Some key notes on sample size and the transferability of learned features in CNNs

5. Conclusions

Data Accessibility Statement

Additional Files

Supplementary Materials

Supplementary Materials

Acknowledgements

Funding Information

Competing Interests

Author Contributions

Paradigm

My account