Machine Learning Applications in Archaeological Practices: A Review

Mathias Bellat; Jordy Didier Orellana Figueroa; Jonathan Scott Reeves; Ruhollah Taghizadeh-Mehrjardi; Claudio Tennie; Thomas Scholten

doi:10.5334/jcaa.201

Full Article

1. Introduction

A study by Binford and Binford (1966) on the characterisation and classification of Mousterian assemblages was one of the first steps into multivariate statistical analysis in archaeological research, and opened a window for further, more developed applications. Other major developments in statistical and computing analyses soon followed (Djindjian 2015, p. 66), dealing with classification and clustering problems (Hodson 1970), predictive archaeology (Judge and Sebastian 1988), or various quantitative techniques to fully bring together data structures, quantitative analyses and theoretical interpretations (Carr 1989; Voorrips 1990). The first traces or machine learning applications in archaeology already date back to the 1970s (Kowalski et al. 1972; Thomas 1973), and their extensive use has grown with advancements in computing technology (Cacciari and Pocobelli 2022; Cardarelli 2025).

Unlike traditional statistical models, like linear models such as ordinary least squares or general linear models that find solutions with formal equations, machine learning methods seek to use the data itself to create a model able to accurately evaluate new, unseen data. Machine learning is defined by Alpaydin as “programming computers to optimize a performance criterion using example data or past experience” (Alpaydin 2014, p. 3). Machine learning models require a training phase, where the computer learns from the data to improve its model and thus its predictions. Once trained, machine learning models often yield faster, more accurate results than traditional methods, and all while lowering costs (e.g. Kochkov et al. 2024). This benefit can make machine learning an appealing solution to certain scientific questions (Verhagen 2007; Hansen and Nebel 2020; Calder et al. 2022; Brandsen and Koole 2022; Orellana Figueroa et al. in press). However, this comes at the cost of both the models being difficult to interpret (Carvalho et al. 2019), and the training process generally requiring large amounts of (high quality) data, time, and compute power; in a way, front-loading the time and costs to the training process to produce fast and accurate predictions once trained (Sevilla et al. 2022).

In archaeology, machine learning includes a variety of applications such as the classification of zooarchaeological remains (Boon et al. 2009; Anichini et al. 2021; Cole et al. 2022; Anglisano et al. 2022), spatial pattern and mobility analyses (Stott et al. 2019; Štular et al. 2022), cultural heritage reconstruction and preservation (Toler-Franklin et al. 2010; Grilli and Remondino 2019; Castiello 2022; Parsons 2023), the study of settlement dynamics (Miera et al. 2022), taphonomic classification (Byeon et al. 2019) and on-site analysis of the origin and function of sediments and the spatial distribution of artefacts (Orengo and Garcia-Molsosa 2019; Ginau et al. 2020; Reese 2021; Agapiou et al. 2021).

The frequency, extent, and success of machine learning applications across archaeology remain largely unknown and their implications and ethical considerations are under debate (Davis 2020a; Bickler 2021; Tenzer et al. 2024; Gattiglia 2025; Neri and Dadà 2025). Previous reviews of machine learning applications in archaeology mainly focused on remote sensing (Jamil et al. 2022; Argyrou and Agapiou 2022), text analysis (Gonzalez-Perez et al. 2023), classification of ceramics (Ling et al. 2024), use-wear analysis (Eleftheriadou et al., 2025) and other artefacts (Naso and Sciuto 2025), site-conservation (Casillo et al. 2025), or only on a small set of study cases (Mantovan and Nanni 2020). On the other hand, while Cacciari and Pocobelli (2022) have assessed machine learning applications across a wide range of fields of archaeological research, they mainly provide qualitative observations rather than a quantitative analysis as presented here, as is also the case for the overview published by Calder et al. (2022). Palacios (2023) provided an innovative introduction on the status of machine learning in archaeology, although mainly focused on Bayesian approaches, and with a limited selection of reviewed papers mainly on archaeological predictive models. A recent comparison of machine learning methods to PCA and LDA for species identity from faunal material find that methods like RF and LDA outperform PCA (Cole et al. 2022).

The inherent importance of the fieldwork component that generates the data for machine learning approaches, as well as the variety of methods and practices existing in archaeology (Kelly and Thomas 2017; Renfrew and Bahn 2020) all provide some difficulties in delimiting the scope of the field. Statistical analyses of artefact morphology and their typological classification, the process of site prospection using satellite imagery, the quantitative analysis of animal bones at a human-created site, and the analysis of the soil and geological composition of a site are all part of archaeological research.

Moreover, machine learning refers to a large set of different statistical methods and algorithms, from those based on Bayesian statistics, to evolutionary algorithms that seek to emulate the process of natural selection (Hastie et al. 2009; Kubat 2017). Recent developments in artificial neural networks have had an impact far beyond the limits of academic research (Pangti et al. 2021; Bachute and Subhedar 2021; Kawamleh 2024), not to mention the emergence of so-called generative artificial intelligence, which is trained to create text, images, audio, or even video based on simple text prompts (Zhang et al. 2016; Vaswani et al. 2017; Liu et al. 2023). Such advances in machine learning methods and their successful application to a specific question in one area often led to experimentation in other (sometimes completely unrelated) areas. The breadth of methods and practices used in archaeology thus provides many opportunities for new advances in machine learning to be applied.

Therefore, if we wish to obtain a comprehensive view of how machine learning methods have impacted the field of archaeology, we must look broadly across the various machine learning architectures, as well as the many subfields and chronological focuses of archaeological research, from the survey phase to post-excavation analysis and interpretation. Furthermore, to determine the frequency of machine learning applications and whether the recent high-profile advances in artificial intelligence have increased interest in its use in archaeological research, we must undertake a quantitative and chronological analysis of the published literature.

With this review article, we aim to provide a comprehensive analysis of previous applications of machine learning methods in the field of archaeology. Our primary objective is to provide an overview of current trends and methodological pitfalls in the field. As secondary aims, we propose potential solutions to these issues and outline possible future directions for research. We analysed a corpus of academic publications across a – as comprehensive as possible – set of categories of both machine learning method families and different archaeological subfields, presenting a chronological overview of machine learning applications in a wide spectrum of the archaeological research, as well as a discussion of our analysis of the reviewed articles, the methods applied, the results obtained, and the provided conclusions.

2. Methods

A rapid systematic review protocol was conducted (Petticrew and Roberts 2006; Grant and Booth 2009; Jesson et al. 2012; Peters et al. 2015; Page et al. 2021) to obtain a representative dataset for a broad overview of machine learning applications in archaeological research. Although there is no consensus on the specific methodology of a rapid review, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 guideline (PRISMA, Page et al. 2021) and made our methodology as transparent as possible, as suggested by Haby et al. (2016).

2.1. Search strategy

The search strategy comprised two different and independent protocols: one performed for the automatic screening protocol (consisting of a fully automated and scripted workflow), consisting of multiple searches, and one performed for the manual screening protocol (mixing automated screening with results screening by a human), consisting of a single search (Table 1, Figure 1). The searches for the automatic protocol were performed by M.B. and consisted of twelve queries each containing a combination of keywords (Box 1). These keywords, while not covering the entire scope of possible archaeological study cases using machine learning (such as those not using the exact term of “machine learning”), already grasp a wide overview of machine learning applications in archaeology. It is however important to note that this might have biased the search in favour of articles that use the keywords of “machine learning”. On the other hand, an earlier search using the manual protocol involved multiple queries using the 10 machine learning method families we used for our data analysis (see supplementary material), but these returned a large number of duplicate results, most of which were already captured by simply using “machine learning”. The searches were performed in five online portals. All searches were conducted in English, except for the German National Library, where the search was conducted in German with only six of the queries (Box 1). These searches covered all records published before 1 January 2023. In total, 1460 records were retrieved from the automated search protocol, of which 558 were unique.

Table 1

Summary of results of both automatic and manual protocol searches on the six online portals. Note that the “Sum of unique totals” refers only to the sum of the number of non-duplicate results from each search; however, there were many publications present in both searches (see below), which would further reduce the sum total of unique items.

BIBLIOGRAPHICAL DATABASE	KEYWORDS MATCHES
Automatic screening:
Web of Science	969
PubMed	413
Tübingen University Library	51
German Archaeological Institute	24
German National Library	3
Total unique	558
Manual screening:
Google Scholar	300
Total unique	285
Total	1760
Sum of unique totals	730

Review process from source selection to analysis. Inspired by the PRISMA 2020 flow diagram (Page et al. 2021). Reason 1 = Ineligible with automation tool; Reason 2 = Non-English record; Reason 3 = Full text not accessible; Reason 4 = Non-journal-based publications; Reason 5 = Absence of abstract; Reason 6 = Archaeology and machine learning keywords from the list not present in the text; Reason 7 = Archaeology and machine learning keywords from the list are not present in the abstract or in the title; Reason 8 = Preliminary exclusion (i.e. no access to publication, publications or contribution by current authors, entire books, non-academic reports, preprints, reviews or theoretical papers, potentially predatory journal); Reason 9 = Excluded based on the title; Reason 10 = Excluded based on abstract; Reason 11 = Excluded based on the full text first reading; Reason 12 = Full text does not involve archaeological research; Reason = 13 Full text does not involve machine learning methods as defined in our protocol; Reason 14 = Conflicts of interest (publication by the authors of this review or in which the authors contributed); Reason 15 = Theory or review paper. Figure created using Microsoft Word and Inkscape.

Box 1 Search Queries used for the automatic protocol search. The German version of the queries used for the German National Library (DNB) portals are specified in parentheses where applicable (due to some queries being only different in the use of a spelling variant in english for “archaeology”, which is not relevant in German)

Query 1 = “archaeology machine learning” (“archäologie maschinelles lernen”)

Query 2 = “archeology machine learning”

Query 3 = “archaeological machine learning” (“archäologisch maschinelles lernen”)

Query 4 = “archeological machine learning”

Query 5 = “archaeology deep learning” (“archäologie deep learning”)

Query 6 = “archeology deep learning”

Query 7 = “archaeological deep learning” (“archäologisch deep lernen”)

Query 8 = “archeological deep learning”

Query 9 = “archaeology artificial intelligence” (“archäologie künstliche intelligenz”)

Query 10 = “archeology artificial intelligence”

Query 11 = “archaeological artificial intelligence” (“archäologisch künstliche intelligenz”)

Query 12 = “archeological artificial intelligence”

The manual protocol search extracted results from Google Scholar without any constraints on years of publication but was done on 6 January 2023 to ensure results only from 2022 and earlier. It was performed by J.D.O.F. and consisted of a single query of keyword combinations performed using a script written in Python 3.9.16 (Python Software Foundation 2022) with the habanero, pandas, and scholarly libraries (Chamberlain 2022; Pandas development team 2022; Cholewiak et al. 2022), along with geckodriver (Mozilla 2023), which obtained the top 300 results as ranked by Google Scholar, automatically parsing the metadata of these results (Orellana Figueroa 2020). This search query and its keywords included the term “machine learning” and various forms of the term “archaeology” (Box 2) and retrieved a total of 285 unique records.

Box 2 Search query used for the manual protocol search

2.2. Screening

Following the automatic and manual protocol searches, two independent screening procedures were performed on the retrieved records (Figure 1). For the automatic protocol, after excluding publications not written in English and records not published in scientific journals, the screening was performed with a script written in R 4.4.1 (R Core Team et al. 2024) with the corpus and tm packages (Huang et al. 2021; Feinerer and Hornik 2023) divided in two filters. The first screening step retrieved all documents whose full text contained one of the archaeology or machine learning keywords we defined for this filtering step (supplementary material). The second screening step was performed in the same manner with the same keywords, but using only the abstract and title of the results of the first screening step that were found to contain the keywords in the full text. A total of 135 records were included from the automatic search and screening protocol.

The manual screening protocol consisted of additional steps. First, a preliminary exclusion was performed, which checked whether the publications were inaccessible, whether they were preprints that were not yet in print, and whether the record was for an entire book, amongst other criteria (Figure 1, Supplementary file). Further screening was then performed based on the exclusion and inclusion criteria (supplementary file), firstly by examining the publication title alone. If after this, an article could not be securely excluded or included, a second screening step followed, now examining the abstract as well. If this was yet still insufficient for a secure inclusion or exclusion, the full-text was examined. A total of 93 records were included from the manual protocol search and screening protocol.

Both screening strategies used the same inclusion and exclusion criteria (Figure 1, supplementary file), though the automatic screening script, by necessity, used only a simplified version that only involved the use of keywords, while the manual protocol search was done through the human-based analysis of text.

However, to further verify the level of agreement between the two different screening strategies, the same automatic screening was performed on the 285 collected records from the manual protocol search. The results of this screening are reported below (cf. 3.1).

2.3. Data extraction

We systematically divided each of the included articles into one or more study cases if different applications (e.g. using different data or seeking different goals) were attested in the publication. Therefore, we obtained a total number of study cases (n = 147) superior to the sum of all included articles (n = 135). From all these study cases, we systematically recorded a total of nine features composed of different categories (Table 2). These features were based on the authors’ ad hoc evaluation of possible features of interest in the evaluation of the study cases. No inter-rater reliability was calculated for the selection of these features. However, only four of these were helpful to gain a clear overview of the practice of machine learning in archaeology: the architecture of the machine learning model used as well as its evaluation process (see below), the archaeological subfields where these methods were applied, and the type of task it was applied to (Figure 2). Though important, the other groups of features did not allow us to identify common patterns in the application of machine learning in archaeology. Thus, we included them only sporadically in our discussion. Furthermore, we extracted metadata related to the publication for our quantitative analysis. These metadata included the year of publication, the list of authors, the country of affiliation of the first author, the name of the journal, and whether the article was published through an open-access modality.

Table 2

The nine features collected systematically from the review.

FEATURE	NUMBER OF CATEGORIES
Model	70
Best model	17
Family	9
Subfield	15
Input data	11
Evaluation	3
Task	19
Result	5
Pre-training	4

The fourth field of information recorded in the review presents significant characteristics to explain variation in machine learning applications in archaeology and their related classes/categories. One study case might have been attributed to several subfields or architecture categories. Figure generated with R 4.2.2 (code available in supplementary material 3) and additional editing with Inkscape.

Based on the model architecture, we divided machine learning methods into nine broad categories based on the family of algorithms they belonged to (Eleftheriadou et al. 2025, Figure 2, supplementary file). We granularly recorded the different types of models used, which were then classified into the relevant architecture category, but we also recorded the frequency of their use. Each model was classified into only one architecture category, even in cases where multiple categories may have been warranted. For example, random forest is an ensemble of decision trees, fitting into both “ensemble learning” and “decision trees”, but was classified into “ensemble learning” only, as that was the more relevant aspect of the algorithm.

Furthermore, according to the goals of the model application, we grouped the evaluation into three broad categories: classification, regression, and clustering (Figure 2; Alpaydin 2014, pp. 5–13).

Due to the wide range of subjects, we divided the field of archaeology into fifteen subfields (Figure 2) based on Kelly and Thomas (2017). The assignment of a study case to a specific subfield was based on the authors’ own evaluation (see supplementary material). No inter-rater reliability calculation was performed for the assignment. A single publication could be classified into several subfields based on their research questions or objectives, and therefore, the sum total entries in the subfield feature does not equal the total number of study cases included in the review.

Finally, we classified every article’s application into nineteen a posteriori categories based on the broader task (e.g. automatic structure detection) for which the machine learning methods were applied (Figure 2). These task categories were ad hoc and based on the authors’ evaluation of the corpus. Initially, we sought to define the tasks for which machine learning was applied for each publication from a very granular level, then aggregating the more granular task categories into broader ones a total of five times until we arrived at the nineteen categories reported here (see supplementary material). No inter-rater reliability calculation was performed for the creation of these tasks categories. These task categories were one of the most important characteristics to analyse, as these classes helped us understand not only the goals of the authors, but also the possible models that were more well-suited for the desired outcomes, based especially on previous applications.

3. Results

3.1. Screening results

Since the two screening protocols were performed independently, we will report them as such, even if the records obtained after both screenings were merged into a single set of publications (Figure 1). Out of the 558 unique records obtained from the automatic protocol, four could not be screened with our script (0.72%, reason 1), another four were excluded, as they were not written in English (0.72%, reason 2), and an additional six were also excluded as the full text was inaccessible (1.08%, reason 3). Furthermore, 102 records were excluded, as they were not academic periodical journal articles (18.28%, reason 4).

From the remaining 442 records, an additional four were excluded, as they did not have any abstract available (0.72%, reason 5). Furthermore, a screening based on the presence or absence of keywords (supplementary file) in the full-text was performed, and from this we excluded 123 records that did not include any of our defined keywords (27.83%, reason 6).

After the previous screening step, we performed a second filter on the 315 records which were left, based on the same set of keywords as previously but performed on the title and abstract only. A total of 170 (30.05%, reason 7) additional records were then excluded, obtaining a total of 145 records for the final list of included publications from our automatic screening.

With the 285 unique records from the manual query, we performed a preliminary exclusion and removed 78 records (27.37%, reason 8). With the remaining 207 articles, we performed our title-based screening and excluded 50 records (17.54%, reason 9), as well as our abstract-based screening, excluding an additional 11 records (3.86%, reason 10). The final content-based screening, based on the full-text of the publications, led to the exclusion of 17 records (5.96%, reason 11).

After these screening steps we were left with 129 records. We then removed any articles that were not published in academic periodical journals, leading to the exclusion of 37 records (12.98%, reason 4). We thus obtained the final list of included publications from our manual screening, containing a total of 92 items.

To verify the level of agreement between the two protocols, we performed the automatic screening protocol on the 285 unique records found through our manual protocol search. Of the 92 records included through the manual screening, 39 records were also included through automatic screening of the manual protocol search results, while 54 were included through manual screening but not through the automatic screening protocol. In addition, a further 25 records that were not included through manual screening were flagged positive through the automatic screening protocol, though these 25 were not part of the dataset of reviewed articles.

In total we obtained 196 unique items when both results from manual and automatic screening were merged. 42 articles were included through both screening processes, while 51 articles were unique to the manual protocol, and 103 were unique to the automatic one. Additional records were excluded a posteriori during the reviewing of the articles for a wide variety of reasons (Figure 1, reasons 12, 13, 14 and 15), though this was mostly relevant for those articles that were screened automatically, since the automatic screening script was not exact. In total, 61 records (31.12% of 196) were excluded. The articles that were removed to avoid reviewing articles that the authors contributed to (Figure 1, reason 14) were McPherron et al. (2022) and Orellana Figueroa et al. (2021). The final list of articles reviewed contained a total of 135 records.

Finally, to perform a simple diachronic analysis of the number of publications from recent years, we performed the same automatic screening and search protocol, but searching only for articles published between 1 January 2023 and 31 September 2024. The results are described below.

3.2. Metadata analysis

Through a chronological analysis of the reviewed articles (Figure 3), we could distinguish a visible increase in the number of publications in recent years, with over 80% articles reviewed published after 2018. From our quick search for publications between January 2023 and September 2024, which laid outside some of inclusion criteria, and thus we did not review, we obtained a total of 278 unique records. This number is slightly higher (by 12%) than the number of records from 2021 to 2022 from the results of our automatic protocol search (n = 248). In comparison, the number of articles found that were published in 2021 and 2022 found from the automatic protocol search showed an increase of 82% compared to those published in 2019 and 2020 (n = 136). A total of 76 records from the 278 obtained from our 2023 and 2024 search passed the automatic screening protocol. Although publications from the last trimester of 2024 were not included, our results suggest a stabilisation in terms of publication numbers in contrast to the large increase after 2020.

Number of publications per year between 1997 and 2022, in light blue the articles published after 2018 concentrated more than 80% of the publications. The dashed line represents publications from 1 January 2023 to 31 September 2024. Figure generated with R 4.2.2 (code available in supplementary material 3) with additional editing in Adobe Illustrator.

Focusing now only on the 135 publications reviewed, we observed that 10 journals had published nearly half of them (67 records, 49.63%), from a total of 68 different journals observed. Out of these, 6 journals had published nearly a quarter of all articles reviewed (23.70%, Table 3). The journal Remote Sensing had the most publications, followed by the Journal of Archaeological Science. However, if we use the journal’s impact factor, as well as the journal-wide h-index, multiplying them by the number of publications, we obtain two rudimentary research impact metrics across all the publications reviewed. From this analysis, PLOS One obtained the highest combined score for the h-index-based metric and second for the impact factor–based metric (IF). An analysis of the publication policies revealed that a total of 95 articles (70.37%) were published with an open access modality (including gold, silver, or bronze) or were otherwise freely accessible. Some bias towards larger journals or more open access articles may have been occurred due to our search strategy (which for example, did not make use of Scopus) or search queries (see Boxes 1 and 2). Additional bias could have come from a lower interest from these journals in publishing machine learning applications compared to more well-established approaches.

Table 3

The ten most represented journals and their h-index and Impact factor (IF) score and total score by the number of articles, n = 135. Metrics were consulted on 14/07/2024 on the paper website for the impact factor or on SJR for the h-index (supplementary file).

JOURNAL	NUM. OF ARTICLES	H-INDEX	N ⋅ H-INDEX	IF	N ⋅ IF
Remote Sensing	15	193	2895	4.2	63
Journal of Archaeological Science	14	152	2128	2.6	36.4
PLOS One	11	435	4785	3.75	41.25
Scientific Reports	6	315	1890	3.8	22.8
Journal of Computer Applications in Archaeology	5	15	75	N/A	N/A
Archaeological Prospections	4	46	184	2.1	8.4
Journal on Computing and Cultural Heritage	3	35	105	2.7	8.1
Archaeological and Anthropological Sciences	3	42	126	2.14	6.42
Palaeogeography Palaeoclimatology Palaeoecology	3	177	531	2.6	7.8
Virtual Archaeology Review	3	17	51	1.6	4.8

The geographical distribution of the main institutions of the first authors from our set of reviewed articles shows that European and Anglosphere countries, sometimes referred to as the “Global North”, are over-represented (Figure 4), already observed by Davis (2020b). Although one could also draw a line between the northern and southern hemispheres to divide the geographical distribution of the analysed publications, this division is not as clear, as countries such as New Zealand and Australia, as well as to some extent Argentina, which is part of the “Global South”, challenge this notion.

Number of articles published per country based on the country of the first author’s affiliation. Figure generated with R 4.2.2 (code available in supplementary material 3).

3.3. Review findings

3.3.1. Statistical overview

Among the different subfields of archaeology present in our review, surveying (and site prospection), conservation and cataloguing, as well as classification and typology, are the most represented in our dataset and together account for 49% of all study cases (Figure 5A). Landscape archaeology and geoarchaeology are also well represented with 20 and 17 attributed study cases respectively. The oldest applications (1997 – 2016) of machine learning in archaeology are linked with the subfields of classification and typology as well as conservation and cataloguing. In recent years, there has been a general increase in all topics with no clear trend visible, except for the subfields of surveying as well as conservation and cataloguing, which have considerably increased since 2021 (Figure 5A). Only the subfields of zooarchaeology, archaeobotany, and archaeological excavation remain nearly constant through time.

(A) Number of articles from each archaeological subfield between 1997 and 2022. **(B)** Number of articles from each architecture class between 1997 and 2022. Empty bar charts represent the number 1. Figure generated with R 4.2.2 (code available in supplementary material 3).

Examining the different architecture of machine learning methods used, we observed that artificial neural networks (ANNs) and ensemble learning represent 62% of all study cases, with 110 and 70 occurrences respectively (Figure 2). Linear classifiers, decision trees, rule induction, and nearest neighbour classifiers were also well-represented, accounting for 25% of total applications. An increase of the use of ANNs in our list of reviewed articles from 2021 onwards could be observed, with the number of applications multiplying four-fold compared to the number of applications in the years 2019 and 2020 (Figure 5B). At a more granular scale, we found a total of 70 machine learning models (Figure 6, Annexe). In addition, we found a particularly high diversity in the unsupervised learning and clustering category of model families, with a unique model for each of the twelve recorded applications for the category (Figure 6, Annexe). Despite the dominance of ANNs (Figure 6), random forest was the most common individual model (n = 54). We found that across all articles reviewed, the mean number of models used per study case was 2.12, with a high disparity across different publications, with some applying only one model, while articles such as Bataille et al. (2018) and Courtenay et al. (2019) tested six different models at once. Classification was the most common form of data evaluation, applied in more than 75% of study cases (n = 112). On the other hand, only 19% (n = 29) were regression models, and clustering was only represented with six uses (6%, Figures 2, 8 and 9).

Tree map of the different models seen in our corpus as well as the family of models they belonged to in our categorisation. Figure generated with R 4.2.2 (code available in supplementary material 3).

It was more difficult to observe patterns for the other features we collected, as the attribution to one or several categories for an article was more subjective to the person reviewing. Two of the a posteriori study tasks are highly prominent: automatic structure detection and artefact classification. These two tasks account for 45% (n = 67) of all study case tasks (Figure 2). Tasks such as taphonomic classification, archaeological predictive models, and architectural element classification are well represented with 5 to13 study cases each, whilst nine task categories (e.g. sourcing, species classification, movement recognition) were only relevant for around 5 study cases.

For the different types of input data observed in all study cases we reviewed, remote sensing images were the most represented, with around 40% (n = 58) of the total. Four other input types (small-scale images, artefact measurements, spectra, and 3D models) were also well represented (Figure 7A), being present in between 6 and 16% of all study cases reviewed. The remaining nine input types were poorly represented, being present in only 14% of all study cases.

**(A)** The five more represent classes of input data among the reviewed papers, n = 148. **(B)** Results of the reviewed papers according to the authors or presented results, n = 147. Figure generated with R 4.2.2 (code available in supplementary material 3).

Finally, according to the assessment of article authors or (if not discussed) the results metrics, we could observe that most applications (79) had reported “successful” outcomes (Figure 7B), with an additional 40 mixed results (27.2%). Only 15 applications (10.2%) were reported to have been “unsuccessful”, whilst we noted 9 (6.1%) study cases whose application had important methodological issues and classified them as such even if the authors reported a successful outcome. Finally, four (2.7%) study cases did not have a defined outcome. These results, however, could be affected by the well-reported issue of publication bias against the reporting of negative results (Song et al. 2013), as well as the related file-drawer effect (Rosenthal 1979; Sparks 2009), which can strongly affect the perception and production of scientific results (Song et al. 2010).

3.3.2. Aggregated information

Through the aggregation of the data annotated for the study tasks, method families, archaeological subfields, and input types we obtained an overview of the relationship between the objectives of the study cases and the means used to access them.

Correlations between tasks and model families are highly heterogeneous (Figure 8). Whilst the highest correspondence rate between taphonomic classification and unsupervised learning and clustering methods is 26%, the correspondence between artefact classification and ANNs is much higher, at 49%. Even stronger correlation rates could be observed between archaeological predictive models and ensemble learning methods at 66%, and between automatic structure detection and ANNs, at 60%.

Alluvial diagram of the different tasks in the analysed studies on the left, the related architecture of machine learning models on the right and the evaluation process in the background. Tasks and architectures poorly represented (n < 5) have been classified as “others”. A study might have applied numerous models, or its research objectives could be classified into more than one task. In such cases, we created multiple entries for each paper where applicable (see supplementary material). Figure generated with R 4.2.2 (code available in supplementary material 3).

The correlations between archaeological subfields and study tasks are lower than the correlations visible for architecture (Figure 9). Articles dealing with environmental reconstruction had their highest correlation at only 33% with the subfield of geoarchaeology. Studies treating artefact classification problematics have a higher 45% correspondence rate with subfields of classification and typology. The highest correlation rate, however, is present between the task of automatic structure detection and the subfield of surveying, at 80% correlation.

Alluvial diagram of the different tasks in the analysed studies on the left, the related archaeological subfields on the right with the evaluation process in the background. Tasks and subfields poorly represented (n < 5) have been classified as “others”. A study might have been attributed to several subfields, or its research objectives could be classified into more than one task. In such cases, we created multiple entries for each paper where applicable (see supplementary material). Figure generated with R 4.2.2 (code available in supplementary material 3).

Finally, and perhaps expectedly, several of the most represented tasks had the largest number of study cases with reported unsuccessful or mixed results (Figure 10). Studies dealing with automatic structure detection (25.85% of total) and artefact classification (19.73% of total) accounted for 60% and 33% (respectively) of studies that reported mixed or unsuccessful results. Other less represented tasks have higher success rates. Articles dealing with artefact prediction are successful in 80% of cases, whilst articles dealing with environmental reconstruction all report successful results.

Alluvial diagram of the different tasks in the analysed studies on the left, the related results on the right with the evaluation process in the background. Tasks poorly represented (n < 5) have been classified as “others”. A study might have its research objectives classified into more than one task. In such cases, we created multiple entries for each paper where applicable (see supplementary material). Figure generated with R 4.2.2 (code available in supplementary material 3).

3.3.3. Classification of artefacts and animal remains

Whilst classification tasks were mainly performed on ceramic material (31%), as in Hörr et al. (2014) or Anichini et al. (2021), a wide range of different materials were used in our dataset of reviewed articles. Studies on coins (Boon et al. 2009), stone tools (MacLeod 2018; Pargeter et al. 2019; Emmitt et al. 2022; as well as Orellana Figueroa et al. 2021, excluded from our review to avoid possible conflicts of interest, though still relevant), archaeobotanical seed remains (Landa et al. 2021), phytoliths (Berganzo-Besga et al. 2022), ivory figurines (Gansell et al. 2014), were observed. Moreover, 17% of study cases were conducted on cave art images, such as in Kogou et al. (2020), or Horn et al. (2022b, a). On all classification study cases 17% (n = 5) dealt with the bioarchaeological classification of human or animal remains. Due to the variety of materials studied, the input data for models were very diverse. Small-scale images accounted for 48% (n = 14) of all input data for classification task, metric data for 27% (n = 8), as well as less frequent input data types such as multilayer images of artefact (e.g. depth maps, 3D model layers), spectra and remote sensing images all together accounted for 23% (n = 7).

In our corpus, two machine learning method families emerge as the foremost choices for artefact classification, ANNs represent 50% of the models used, followed by ensemble learning, accounting for 31% of the models (Figure 8). Bayesian classifiers, as well as decision tree and rule induction are also present, but represent less than 15% of the model architectures for the artefact and animal remains classification task.

3.3.4. Archaeological predictive models (APMs)

Ten studies in our review focused on archaeological predictive models (APMs), dealing with either human landscape occupation patterns, or human-environment interactions. Input data were in 77% (n = 7) of all study case larger-scale raster images from a diverse range of natural covariates or remote sensing images.

The most common method family applied was ensemble learning, representing 62.5% (n = 10) of the entries for this task. A unique model entry exclusively applied to this task was Maximum Entropy (MaxEnt) (Benner et al. 2019; Yaworsky et al. 2020). Regarding the results from APMs, 66% (n = 6) were classified as successful, with the remaining 44% (n = 4) classified as partially successful or unsuccessful, such as in Hansen and Nebel (2020) or Miera et al. (2022).

3.3.5. Automatic structure detection

Representing 25% (n = 38) of all study cases, automatic structures detection, also known as geographical object-based image analysis (GEOBIA), is the most prominent task category in our review. Input data came from a wide range of remote sensing images, mainly LiDAR or airborne laser scanning (ALS), used in 50% of all study cases in this task, or various satellite image collections (23%), such as in Menze et al. (2006) or Orengo et al. (2020). The remaining 27% input data used were either UAV ortho-images, such as in Orengo and Garcia-Molsosa (2019), Monna et al. (2020), Agapiou et al. (2021), Altaweel et al. (2022), and Fisher et al. (2022), historical maps, such as Garcia-Molsosa et al. (2021), and since 2020, ground-penetrating radar (GPR), or even bathymetric data (Febriawan et al. 2020; Bordon et al. 2021).

ANNs were the most prominent family of methods for automatic structure detection, accounting for 62% of all models applied for this task. Different ANN models were used, the most popular being mask region-based convolutional neural networks (MR-CNN), which accounted for 13% of all models applied for automatic structure detection, followed by faster region-based convolutional neural networks (FR-CNN), and U-Net, which each accounted for 8% and 7% of the models used. However, the most popular model was random forest, which accounted for 16% of all models used for this task. The results of applications in this task were mainly reported as unsuccessful or partially successful (59%, n = 31).

3.3.6. Digital heritage

Articles focused on this task were mainly represented by those dealing with the classification and reconstruction of architectural elements, with a total of twelve study cases (8% of all 147 study cases reviewed). The models in this task category mainly used small-scale images from architectural photos as input (58%, n = 7), but also point clouds and 3D models in 33% of the revived study cases, such as in Grilli and Remondino (2019) or in Matrone and Martini (2021).

The models applied in this task are somewhat equally divided between two families of models: ensemble learning with 35% of all study cases, and decision trees and rule induction models accounting for 25% of total models applied. Machine learning applications for this task were generally negatively assessed, in comparison to applications for other tasks, with only 40% of study cases reporting successful results.

3.3.7. Text analysis

We counted five study cases that applied machine learning methods to text analysis tasks (3% of all 147 study cases), and all except one (Dhivya and Devi 2021) used text as input data for their model. Almost all applied models were based on ANNs (83%, n = 5), such as in Dhivya and Devi (2021) and Brandsen and Lippok (2021), only Boon et al. (2009) used a memory-based learning (MBL) model, part of the category of unsupervised learning and clustering algorithms. The results of the applications for this task were overwhelmingly negatively assessed, with four out of the five studies reporting mixed or unsuccessful results.

3.3.8. Taphonomic classification

We reviewed a total of twelve study cases categorised as dealing with the classification of taphonomic features on archaeological finds (8% of all case studies). All of these articles were applied to osseous material, and all but two of them deal with bone surface modifications (BSMs), with the remaining two dealing with bone breakage. Bone surface modifications are marks left on the surface of bones by agents such as hominin butchery (e.g. using stone tools), animal carnivory, or trampling. The goal of all the articles under discussion was to uncover what agent caused the specific taphonomic process (BSM or breakage). Input data were primarily metric and categorical variables of bone marks or breaks in six cases (50%), or microscope images in three cases (25%), with the remaining three (25%) using PCA scores from the analysis of geometric morphometric landmarks.

Many of the articles reviewed applied a large number of machine learning methods as well as methods that we do not consider machine learning (such as PLSDA and MDA; e.g. Domínguez-Rodrigo 2018; Abellán et al. 2022) to compare their performance on the specific task, whilst others (e.g. Byeon et al. 2019; Cifuentes-Alcobendas and Domínguez-Rodrigo 2019) were more restrained. This meant that there was no clear machine learning method or family of methods that was used much more frequently than others, although neural networks were applied in all but one of the articles reviewed (see “Results”), which is not the case for other methods also commonly used in these articles, such as SVMs or random forests.

ANNs were applied in all the reviewed study cases except for one (Aramendi et al. 2019) and represent 28.3% (n = 15) of all models applied for taphonomic classification. Ensemble learning, linear classifiers, decision trees and rule induction, nearest neighbour classifiers, and Bayesian classifiers all represented between 17% and 11% of the referenced architectures for this task category.

However, many of the articles found were deemed through our reviewing process to have had important methodological issues, especially due to under-reporting or under-detailing of the specific methodology used (see Figure 10; see also 4.2.6).

4. Discussion

4.1. General considerations of the corpus

A possible initial observation is the comparatively late interest in machine learning applications in the field of archaeology compared to other sciences (Dramsch 2020; Padarian et al. 2020; Shehab et al. 2022; Wang et al. 2022). Indeed, the exponential increase in publications related to machine learning applications in archaeology can be dated from the period between 2018 and 2019 onwards (Figure 3). This late development could arguably be attributed to a tradition of suspicion in archaeology (Yoffee and Fowles 2010). Researchers sometimes consider archaeological data as scattered, complex (Smith and Peregrine 2011, p. 7) and unable to be explained by models that are not, fully, human-made. The uncertainty for selecting the best model to analyse the complexity of past human action is not recent (Doran 1990) but has been exacerbated with machine learning and even more with deep learning models, often opaque, “black boxes” in their internal logic (Ramazzotti 2020; Fiorucci et al. 2020; Bickler 2021; Calder et al. 2022). Furthermore, it appears that some sub-disciplines of archaeology have until 2022 yet to exploit the new development of machine learning to the same degree as other sub-disciplines. Following our classification of archaeological subfields (see supplementary material), ethnoarchaeology, archaeobotany, and cognitive archaeology are under-represented in our corpus (Figures 2 and 5A). Although the reasons for such an underrepresentation remain unclear, it is likely that many research questions in these subfields are less suitable for machine learning applications at this current time, or at least to lesser degree than other sub-disciplines (Marom 2025).

Regarding publication policies across our list of reviewed articles, the high representation of open-access papers in a field where the traditional closed-access model of publications is widespread (see below; Marwick and Birch 2018) could be a signal for a paradigm change in which scholars seek to share their knowledge more broadly and fairly (Marwick et al. 2017; Nicholson et al. 2023). While many journals that deal specifically with computer science applications in archaeology (e.g. Archaeologia e Calculatori, Internet Archaeology, or Journal of Computer Applications in Archaeology) only have open access publication routes, more classical and broader journals like the Journal of Archaeological Research, Journal of Archaeological Science or Antiquity provide mainly traditional non-open access papers, with open access papers comprising only (respectively) 10%, less than 5% and less than 2% of all articles on the 1^st of September 2024. These results are unlike observations made for articles dealing with machine learning applications in other disciplines such as soil science (Padarian et al. 2020), where paid-access routes vastly dominate that field of research. However, researchers have to keep in mind the existence of non-open access journal articles and book chapters to avoid a FUTON bias (Wentz 2002), which will lead to not exploiting part of the wider research results in the field. This points to a wider adoption of both FAIR (Findable, Accessible, Interoperable, Reusable) data principles (Wilkinson et al. 2016; Nicholson et al. 2023) and CARE (Collective benefit, Authority to control, Responsibility, Ethics) principles (Carroll et al. 2020; Gupta et al. 2023). Nonetheless, if the degree of open access of machine learning applications in archaeology seems promising, the large disparity borne from colonialism remains unfortunately wide (Davis 2020b). The difference in the number of publications from “Global North” and “Global South” countries can be directly linked with the gap in the economic development between higher and lower-income countries, leading to the underdevelopment of science and technology (Sagasti 1973; Allik et al. 2020). Increasing collaborations between different countries and institutions could fill this gap between higher and lower-income countries (Sonnenwald 2007).

4.2. How is machine learning used across archaeology?

In the earlier stage of machine learning applications in archaeology, many archaeological questions could only be answered by seeking experts in informatics, mathematics, and physics (Menze et al. 2006; Boon et al. 2009; Menze and Ur 2012; Hörr et al. 2014). However, a noticeable shift has occurred over time. Archaeologists have progressively embraced programming their own tools (by using e.g. Python, R, Matlab, Java), and have also turned to specialised software solutions such as Grilli and Remondino (2019), Garcia-Malsosa et al. (2021) or Casini et al. (2023). Nowadays, in many studies, archaeologists are both asking and answering the questions themselves (Orengo and Garcia-Molsosa 2019; Orengo et al. 2020; Yaworsky et al. 2020; Carter et al. 2021). The CAA Special Interest Group on Scientific Scripting Languages in Archaeology (SIG-SSLA)¹ is one example of this polyvalency of archaeologists developing scripting or machine learning tools for their problems. In certain instances, computer science specialists independently pose questions and generate tailored answers for specific applications (Toler-Franklin et al. 2010; Ushizima et al. 2020; Wunderlich et al. 2022). However, there is always a question of the ratio between the time spent developing new skills (Aldenderfer 1998, pp. 109–112) and the increasing complexity, if not in programming, then in understanding, of machine learning methods, making it more and more difficult for an archaeologist to master all abilities needed to develop an entire workflow for a computer-based approach.

The understanding between the one who poses the question and the one who answers is crucial, as the nature of the answers obtained depends intricately on the questions asked. Therefore, archaeologists should be careful to verify the suitability and validity of the applied methods, as only they can understand the full complexity of what happens outside these “black boxes”. Orton once stated that: “… the mathematician’s job is not to tell the archaeologist what to do – the archaeologist maintains his responsibility for what he does – but to help him decide how to do it.” (Orton 1980, pp. 15–16). This remains easily applicable to other specialists such as informaticians, especially relevant here, to this day. However, defining a single, universally applicable methodology for archaeological applications of machine learning methods, already a very large and diverse group, or other quantitative methods remains challenging (e.g. “Statistical Cycle”, Orton 1980, pp. 20, figure 1.3), a pragmatic approach can highlight several best practices depending on the desired task (Figure 11).

Workflow adapted to machine learning solutions applied to archaeological problematics. Figure generated with Microsoft Word and Inkscape.

4.2.1. Classification of artefacts and animal remains

Classifying artefacts and animal (including human) remains into categories is a central aspect of archaeological work (Read 2018). Archaeologists have looked for solutions in computer science since the second half of the twentieth century (cf. 1.; Binford and Binford 1966; Orton 1980, pp. 156–178). It is therefore not surprising to see machine learning applications developed for this purpose already very early (Figure 5A). Earlier papers generally applied methods such as decision trees, ensemble learning, Bayesian classifiers, as well as linear classifiers combined with nearest neighbour classifiers (Nguifo et al. 1997; Hörr et al. 2014; Mircea et al. 2015a; MacLeod 2018; Canul-Ku et al. 2019; Pargeter et al. 2019; González-Molina et al. 2020). Since 2020, however, ANNs have gained more traction (e.g. Ushizima et al. 2020; Graham et al. 2020), and now represent the most used family of methods applied for this task (Figures 6, 8; Ling et al. 2024). ANNs, especially CNNs, are standard for computer vision tasks like object classification (Guo et al. 2016; Chai et al. 2021). Dimensionality reduction methods (e.g. PCA and MDA) are often performed alongside machine learning algorithms and can be an effective way to improve model accuracy when applied correctly (e.g. Muzzall 2021). However, their ability to distinguish categories of artefacts seems limited compared to machine learning methods such as random forest (Cole et al. 2022).

Concerns about the validity of certain forms of typological classification, both human-made and computer-based, as a practice have been previously raised, especially from North American researchers (Adams and Adams 1991; Hörr et al. 2014). Other researchers, however, do not hesitate in using computational approaches for classification (Pargeter et al. 2019), and even creating new classification methodologies with the help of machine learning. Notably, Hörr and others, seeking a less subjective approach to typological classification, developed a three-layer method for this purpose (Hörr 2011, pp. 136–137; Hörr et al. 2014), successively incorporating unsupervised, semi-supervised, and supervised classification algorithms within a framework Fayyad et al. (1996) termed the Knowledge Discovery Process. This approach is promising for the analysis of archaeological material, and could prove useful for many archaeological studies. Such an approach invites critical questions on the artificiality of typological classifications and opens new avenues for future studies such as those dealing with ceramic material or stone tools. We can also mention Martín-Perea et al. (2020), who used a method similar to that of (Hörr et al. 2014) with a three-level pipeline to detect fossiliferous levels of an archaeological or palaeontological site.

Assessing and comparing the outcomes of these diverse classifications is challenging, given the inherent variations in datasets, modelling techniques, and evaluation metrics employed across different studies. Nonetheless, various opinions have already emerged on the issue. Emmit et al. (2022) and Jalandoni et al. (2022) are optimistic about using machine learning for the classification of archaeological material, though we could also add Hörr et al. (2014) or Bickler (2018), who are more cautious but remain hopeful for the future of the practice. Others are less enthusiastic, however, and point to mixed results (Demján et al. 2022b; Lyons et al. 2022). Overall, it seems likely that machine learning applications for the classification of archaeological material will spread not only within the scientific community but also among professionals of archaeology, conservation institutions, and a wider public. The ArchAIDE project is already a noteworthy example of an application of open-access machine learning tools for archaeology to serve the public at large (Gualandi et al. 2021; Anichini et al. 2021, Anichini and Gattiglia, 2022). Another example of a mobile application for ceramics classification is developed in Santos et al. (2024), and a broader project of automation and digitalisation tools for archaeological artefacts has been developed in the Automata initiative (Naso and Sciuto 2025). The reconstruction of ceramics based only on few pottery sherds has also been achieved with machine learning methods (Stamatopoulos and Anagnostopoulos 2016; Cardarelli 2024). Another promising development has been made in Ruschioni et al. (2023), where they classified ceramics based not on their typology but rather on their chemical properties, which is more data-compatible with a machine learning process.

In summary, as one of the core elements of archaeological work, the use of machine learning for the classification of artefacts has been debated theoretically and tested since the early 2000s. While ceramics are the most represented type of artefact used for machine learning-based classification in our corpus, a wide variety of materials have also been used (e.g., ivory, stone tools, etc.). With the majority (62%, n = 18) of study cases using image data as inputs, the recent increase in performance of neural network models could spur further advances for artefact classification. The future of this discipline will also likely be linked with more data-driven material, standardised pre-processing, and user-friendly classification software, be it portable or not.

4.2.2. Archaeological predictive models (APMs)

Already present since the development of settlement patterns analysis (Willey 1953), the tradition of APMs rose alongside the development of modelling and prediction theory in archaeology (Judge and Sebastian 1988). It is therefore surprising to see little interest in applying machine learning methods to this approach in recent years even down to 2022 (Figure 5A). APMs focus on predicting unknown or undiscovered sites so that they can be discovered and documented based on previous field observations and variable predictors, for example climate, topography and anthropic features (Brandt et al. 1992; Kvamme 2006). Critics can point to the relative reluctance of researchers towards APMs (Kvamme 2006, pp. 7, table 1.1; Lock and Harris 2006, pp. 42–45). One main criticism about APMs is that “… predictive modelling as it is presently practiced is fundamentally about environmental determinism.” (Kohler 1988, p. 19). This statement is particularly true when we consider the strong influence of environmental and ecological studies in the field of landscape archaeology (cf. 3.3.4). However, even if this criticism has neither recent origins, nor is it confined to the past (Wheatley 2004; Arponen et al. 2019; Kristiansen 2019), responses to it exist. Coombes and Barber emphasise the interest of models despite their incorrectness, stating that ”A simple model cannot hope to replicate all the complexities of environment–culture relationships across a civilization, but one basic approach that can provide valuable insights is to treat human populations in ecological terms, with their ranges shifting in response to changing conditions” (Coombes and Barber 2005, p. 305). Furthermore, new applications of APMs dedicated to cultural heritage preservation against looting (El-Hajj 2021) can prove to be a promising solution to further aid in the protection of cultural heritage alongside more intensive public funding.

The frequent use of the MaxEnt model for APMs, initially designed by ecologists (Phillips et al. 2006; Sillero et al. 2021), shows the influence of environmental niche concepts in archaeology (Demján et al. 2022a; Vernon et al. 2022; Lundström et al. 2024; Yaworsky et al. 2024a, b). MaxEnt is well suited to examining questions of dispersion and provides more robust and deterministic archaeological predictions due to it being a presence-only model, using background points rather than true absence or pseudo-absence data (used by other methods, including machine learning) into its modelling: “As a result, MaxEnt is more suitable for archaeological data than the other predictive modelling approaches” (Yaworsky et al. 2020, p. 14). Furthermore, when compared to other models such as random forest, which can be prone to overfitting, MaxEnt appears as the best option for species dispersion problems (Valavi et al. 2022).

An essential consideration in APMs is the number of covariates, predictors, or variables required for the building of the models. In Yaworsky et al. (2020), 55 predictors were used, Hansen and Nebel (2020) employed 26 predictors, Friggens et al. (2021) used 25 predictors, Castiello and Tonini (2021) used 13 predictors, and Benner et al. (2019) relied on 8 predictors. However, if data transformation techniques such as PCA are applied, which seek to reduce the number of covariates whilst maximising the impact of those that remain, a smaller number of predictors can be used for the final model instead (Hansen and Nebel 2020; Yaworsky et al. 2020; El-Hajj 2021). Research on species distribution models assess the maximum number of predictors at $k = \frac{n - 50}{8}$ (Field et al. 2012), k being the number of predictors and n the number of occurrences of the species.

The large number of variables required, the complex preprocessing steps needed, and the difficulty in selecting and interpreting the influence of the model variables, make APMs a time-consuming and less attractive method than approaches based on image recognition, which can be simpler to apply. Furthermore, the theoretical criticisms discussed above must also be added to the methodological difficulty of applying APMs, which could lead to a desire to avoid further complexity in the methods in the form of machine learning methods, which could explain their low numbers compared to applications for other tasks.

4.2.3. Automatic structure detection

Over the past decade, there has been an explosion in machine learning methods applied to automatic structure detection tasks (Figure 5A). Though earlier applications were mostly based on satellite STRM or ASTER images (Menze et al. 2006; Menze and Ur 2012), the recent explosion of light detection and ranging (LiDAR) and airborne laser scanning (ALS) data (Wurzer et al. 2015; Gillings et al. 2020), has led to an increase in interest for this field. While earlier studies employed random forest methods (Menze et al. 2006; Guyot et al. 2018; Stott et al. 2019), recent papers mainly applied ANNs, which aligns with the predominance of image data and neural network architectures that excel at image classification tasks. The recent multi-scale region-based convolutional neural network (MR-CNN) model, adapted from convolutional neural networks (CNNs), has been eagerly applied by archaeologists, as its raison d’être is well aligned with the task of detecting archaeological structures from image data (Bundzel et al. 2020; Bonhage et al. 2021; Carter et al. 2021; Davis et al. 2021; Guyot et al. 2021; Altaweel et al. 2022; Banasiak et al. 2022; Fisher et al. 2022). MR-CNNs segment images via masked regions, aiding structure identification and validation. Furthermore, the emergence of transfer learning has amplified the use of CNNs in archaeology (Gallwey et al. 2019; Soroush et al. 2020; Herrault et al. 2021). By using a neural network model trained for a different but similar type of image data (e.g. to detect bicycles in an image), and training it on the actual desired data (e.g. detect motorcycles) but with a much smaller dataset, compared to the dataset needed without transfer learning, transfer learning can be helpful where only small datasets are available, reducing thus also the time and cost of data acquisition and preparation. In our dataset of articles, the use of transfer learning – counted at 41% (n = 16) of all study cases reviewed on automatic structure detection – were generally based on models pre-trained with image datasets such as ImageNet (Russakovsky et al. 2015) or COCO (Lin et al. 2014), though in one unique case authors also relied on specialised datasets (Silburt et al. 2019).

However, as Herrault et al. (2021) has highlighted, CNNs have limitations in their interpretability and are also too sensitive for unbalanced classes (i.e. the categories available for prediction), a common issue for archaeological data. A deeper exploration of models, as done by Monna et al. (2020), exploring six distinct families of methods (ANNs, Bayesian classifiers, linear classifiers, ensemble learning, unsupervised learning and nearest neighbour classifiers), allows for a better understanding of the data, despite the increase in complexity and time. Following the training of the models, Monna et al. (2020) aggregate predictions through a hard voting mechanism, itself ensemble learning, ultimately identifying random forest (an ensemble learning method) as the best model.

To evaluate the success of machine learning archaeological structure detection models, authors have access to a wide range of metrics. The F1-score is the most widely used with 20 accounted uses (e.g. Character et al. 2021; Altaweel et al. 2022; Banasiak et al. 2022), but prediction accuracy is also common, with 16 applications in the study cases on automatic structure detection (e.g. Monna et al. 2020; Davis and Lundin 2021). While not always directly discussed, precision, average precision, and recall can nevertheless be found in the results in most of the study cases reviewed here (Bonhage et al. 2021; Berganzo-Besga et al. 2021). Intersect over union (IoU) is only present in five study cases (Bundzel et al. 2020; Bordon et al. 2021; Banasiak et al. 2022; Trotter et al. 2022; Yang et al. 2022), despite revealing valuable information on a model accuracy and being particularly well-suited for image segmentation tasks. The newly developed “boundary IoU” (Cheng et al. 2021) was specifically created for evaluating image segmentation models, and could become an important addition to future studies. New metrics have even been developed explicitly for archaeological image segmentation. Fiorucci et al. (2022) introduced centroid-based and pixel-based measures, as they found IoU not ideal for evaluating discrete archaeological objects.

The significant popularity of machine learning applications for automatic structure detection can be further found in its perceived suitability to answer the research question of whether a certain feature is an archaeological site or structure. Machine learning methods seem to be particularly effective when applied to massive structures such as burial mounds (Guyot et al. 2018; Caspari and Crespo 2019; Monna et al. 2020; Berganzo-Besga et al. 2021) or large-size archaeological sites (Menze et al. 2006; Menze and Ur 2012; Stott et al. 2019).

To summarise, following the explosion of big data such as the use of archaeological remote sensing imagery (especially LiDAR), combined with the development of powerful deep learning models, new possibilities for the identification of settlements and archaeological structures have appeared. Leaning massively on transfer learning and pre-trained models, the aforementioned studies applying these newly developed models provide encouraging results to archaeologists. However, the diversity of applications, metrics, pre-processing, and the very nature of the neural network models all lead to a heterogeneity of methodologies available for this task.

4.2.4. Digital heritage

In the articles applying machine learning methods to digital heritage, we identified two different approaches. On the one hand, there were the study cases with a complete workflow, with a first step of semantic image segmentation preprocessing performed via machine learning before a second step of classification based on machine learning (Grilli and Remondino 2019; Nogales et al. 2021). These successive steps follow a gradual and systematic process, which meets the requirements for the Knowledge Discovery Process (cf. 4.2.1, Fayyad et al. 1996). On the other hand, there were works that focused either only on the semantic segmentation step (Felicetti et al. 2021; Matrone and Martini 2021) or only on the classification step (Toler-Franklin et al. 2010; Mesanza-Moraza et al. 2021; Prasomphan 2022; Pavan Kumar et al. 2022; Pepe et al. 2022).

The high number of ANN models for semantic image segmentation adapted to cultural heritage tasks confirms a tendency (Sultana et al. 2020), with ANNs generally obtaining better results compared to decision trees (Boston et al. 2022). Non-standardised datasets and the diversity in the preprocessing stages made for a heterogeneous assemblage of reviewed study cases, making them difficult to adequately compare. The lack of standard protocols (see part. 7 of Fiorucci et al. 2020) has led to the continued use of older models (e.g. Pepe et al. 2022). Limiting the use of colour images or reducing the number of classes has been suggested to improve future models (Grilli and Remondino 2019).

Two different approaches are currently used in machine learning techniques for digital heritage problems: one emphasises a more progressive and integrative workflow with, successive detailed pre-processing steps, while the other focuses on semantic segmentation. Although deep learning models, in particular ANNs, appear to be the most used models, results are still hardly trustworthy in most cases, due to the lack of comparative datasets and standardised metrics.

4.2.5. Text analysis

Text analysis is one of the earliest topics to benefit from machine learning applications (Boon et al. 2009), as the potential benefits were quickly identified (Richards et al. 2015). The quantity of information and data collected in the archaeological field rose exponentially following the development of new technologies, systematic recording processes, and the development of rescue archaeology (Brandsen 2023, p. 229). This veritable deluge of records (Bevan 2015) has made analysing the tremendous number of archaeological texts difficult. It is therefore unsurprising that machine learning methods already developed for text extraction in other fields found adoption in archaeology here.

Two approaches have emerged from this new demand for parsing through large numbers of archaeological reports, along with a third unrelated one. Firstly, machine learning methods can quickly find similarity between two texts where a human may only find heterogeneity (Boon et al. 2009). Secondly, machine learning can also be useful for highlighting the underrepresentation of certain archaeological findings in existing literature (e.g. cremation: Brandsen and Lippok 2021, p. 6). Lastly, researchers have also used machine learning to extract and interpret graphemes or other signs from archaeological material. This last application focuses on extracting glyphs from images (Dhivya and Devi 2021) or finding matching pieces from document fragments (Abitbol et al. 2021). For these image-based tasks, ANNs are the family of models used (Dhivya and Devi 2021; Abitbol et al. 2021) likely due to their effectiveness in computer vision tasks (Cf. 4.2.2). However, we consider the results to still be too limited as of now.

All study cases that used natural language processing, required for analysing content similarity in a text document, all relied on ANNs, except for the early application of Boon et al. (2009), which used instead an unsupervised memory-based-learning model (MBL). Large language models, having been introduced only recently (Vaswani et al. 2017), did not feature in our dataset due to our cutoff point of 2022, though there have already been some applications more recently (Chang et al. 2024). Despite their youth, the impact of large language models across modern society has been large (Tamkin et al. 2021; Eloundou et al. 2023; Clusmann et al. 2023; Tayan et al. 2024) and the archaeological field will undoubtedly implement these new models for either research purposes or education (Agapiou and Lysandrou 2023; Cobb 2023).

Despite its essential role in archaeological research, machine learning applications for text analysis are poorly represented in our corpus. Almost only relying on ANN models in our corpus, these applications face strong methodological and technical issues during the pre-processing of the textual data. However, the recent development of large language models, absent in our corpus, is likely to drastically change the automated analysis of archaeological textual data in the future.

4.2.6. Taphonomic classification

Despite the wide range of methods applied, many of the articles reviewed were found to contain important methodological issues, both from unclear and insufficient reporting of the methods applied, to important considerations of the data used that were seemingly eschewed.

Some of these concerns have already been highlighted (McPherron et al. 2022; but see the reply by Abellán et al. 2022, including its Supplementary Texts 1 and 2; Moclán and Domínguez-Rodrigo 2023; see also Courtenay et al. 2024), specifically the unclear methodology for the process of “bootstrapping” described in one publication in our reviewed dataset (Domínguez-Rodrigo 2018; but also present in our dataset in Domínguez-Rodrigo and Baquedano 2018; Aramendi et al. 2019; Moclán et al. 2019, 2020; and Courtenay et al. 2019). The unclear methodology reported in Domínguez-Rodrigo (2018) could be interpreted as stating that bootstrapping was applied to create a larger overall dataset which was then split for training and testing (as seems to also be explicitly stated in Domínguez-Rodrigo and Baquedano 2018; but less so in Aramendi et al. 2019 or Courtenay et al. 2019). This constitutes a methodological failure, as the models will share the same data for across both training and testing datasets, rendering any results unusable in the same way it would be to have an answer sheet when taking an exam. Although we do not wish to reiterate the “bootstrapping” discussion, it is still important to highlight how imprecise reporting of methodology and code unavailability can cast doubt of the validity of the methods applied.

When the data and code are made available, such as in Abellán et al. 2022, concerns are easier to both discard and confirm. For instance, the Ensemble.ipynb file provided in the Supplementary Data (henceforth SD) 2 contains an off-by-one error, which has the effect of excluding the last variable when training the models used. If this is indeed the code used for the main article (it is at least for the Supplementary Text 4), it would have considerable implications for the reported results and the accuracy of the reported methods. Moreover, the R file croc paper 2022 code.r contains additional problems for the “correct” bootstrapping section (module 4 in the file), and we could not run the code for this module without errors (with R 4.3.2, and caret 7.0–1).

On the other hand, the code suggests that the definition of “bootstrapping” used by the authors is to resample input data with replacement to obtain a larger synthetic input dataset and then performing k-folds when training the model (perhaps explaining the reported use of both bootstrapping and k-folds in Dominguez-Rodrigo 2018 and Domínguez-Rodrigo and Baquedano 2018). This is contrary to the ideal implementation of bootstrapping, where the resampling is done by each classifier in the ensemble learning model, thus creating multiple bootstrapped datasets (one per classifier), as well as providing an out-of-bag error to verify the accuracy of each classifier when aggregating the models (Breiman 1996). The code also implies that their “proper bootstrapping” process for creating a larger dataset may also be applied to the testing data after splitting as well, but this would severely skew model accuracy metrics, since the same data could be correctly or incorrectly predicted multiple times.

Other issues may require even more technical knowledge to identify, such as those in Moclán et al. (2019), who evaluated their models with both a bootstrapped and non-bootstrapped experimental sample, seeking to predict the agent (anthropic direct percussion, spotted hyena (C. crocuta) carnivory and wolf (C. lupus) carnivory) of bone fracture planes in bone fragments (Moclán et al. 2019). However, the dataset provided in the Electronic supplementary material 1 (ESM 1) also contains information on the bone fragment the fracture planes come from (e.g. epiphysis presence, length, and number of planes), rather than only the information of the plane itself (Moclán et al. 2019, app. ESM 1). In some cases a single bone fragment could have six fracture planes in the dataset, so when splitting the dataset, some input data may have come from the same bone as some testing data, and thus have this shared bone metadata, likely leading to overfitting as all bone fracture planes from the same bone fragment were caused by the same agent (see Varoquaux and Cheplygina 2022). Though no code was made available, the authors do state that these metadata variables were indeed used (Moclán et al. 2019, p. 4668). In addition, the number of entries in ESM 1 do not match the number of fragments reported for any of the three classes. Even when counting only unique bone fragments instead of planes using ESM 1, the numbers still do not match for any class. Furthermore, for the hyena class, it is also clear that some fragments did not have all their fracture planes added to the ESM 1 dataset, since there are an odd number of entries that state the planes per bone fragment is 2, thus implying there is at least one plane that was not included (Moclán et al. 2019).

The issues in Moclán et al. (2019) are likely also present in another article (Moclán et al. 2020; and perhaps by Abellán et al. 2022, at least for the tramp_cm_croc_MDR_EB.txt dataset, which contains a ‘number of grooves’ variable). Although the data and code used by Moclán et al. 2020 was not made public, the authors state that the work was a continuation of Moclán et al. (2019), repeatedly citing it as their methodological starting point, and likely using its trained model, especially as the variables described were identical. Moclán et al. (2020) focused on classifying bone fracture agency in archaeological – as opposed to experimental – material, They provide strong conclusions based on the results of a model with considerable unaddressed (and unacknowledged) problems. A similar issue could be raised for Abellán et al. (2022), though they are far more couched in their language. Nevertheless, Fig. 4 in Abellán et al. (2022) states that for some of the archaeological BSMs analysed, the models gave a classification probability of 99% for being crocodile-made, but at the same time also a 99% probability for being a cut mark. Oddly, the probabilities do not add up to 1 except for the last BSM shown (BOU-VP-11/12), the only one classified as a cut mark. Whether this was a transcribing error or a deeper issue is unclear, as detailed results are not available in the SD. The vast class imbalance of input data is another issue, despite the authors suggesting otherwise. Even classifying no crocodile BSMs correctly would provide an F1 score of 0.92 if only 11 out of 146 testing cut marks were misclassified, thus making the lack of confusion matrix or detailed results reporting concerning, even if no strong conclusion was put forth (see Courtenay et al. 2024).

The use of machine learning methods to draw any conclusion from its own predictions of archaeological data must be very carefully done, the limitations of the methods must be clearly and transparently examined, and the strength of the conclusions must be scrupulously tempered. Anything else should not be considered good scientific practice.

However, despite the lengthy discussion above, not all study cases reviewed had methodological shortcomings. The methods used in other articles in this category such as Byeon et al. (2019) and Cifuentes-Alcobendas and Domínguez-Rodrigo (2019) are promising and well-grounded (with only a few caveats, such as the latter’s omission on whether BSMs from the same bone were split into both training and testing, which could have unduly increased validation accuracy; see also Courtenay et al. 2024) and should definitely be the basis of further studies on the subject, including those seeking to independently verify and replicate the results of the numerous BSM agent classification studies, especially as recent attempts have not obtained the same optimistic results, despite employing robust methodology (Courtenay et al. 2024).

4.3. Recommendations and good practices

From our comprehensive review, several lessons for future improvement have emerged, which we have categorised into either methodological or theoretical considerations.

A first significant methodological concern is the lack of standardised workflows and practices. As an example, for automatic structure detection tasks (Bellat and Scholten 2024), it is impractical to compare studies with different inputs (e.g. RGB, DEM, multispectral images) or different and unharmonised preprocessing. One step to increase some standardisation could be to promote transfer learning from pre-trained models based on publicly available datasets, which has proven effective in some reviewed study cases (Gallwey et al. 2019; Herrault et al. 2021). It could be particularly efficient in artefact classification, cultural heritage reconstruction, or automatic structure detection, where numerous image collections already very similar in nature exist; e.g. BigEarthNet (Sumbul et al. 2019), AID (Xia et al. 2017), Million AID (Long et al. 2021), and NWPU-RESISC45 (Cheng et al. 2017). These collections demonstrated their utility in enhancing model accuracy for remote sensing tasks (Wang et al. 2023; Thapa et al. 2023). For text extraction or grey literature comparison tasks, adopting a common workflow for data cleansing and preparation is crucial. In addition, reported metrics should be standardised to allow for better comparisons across studies. In the case of prediction problems, the metrics of precision and recall, as well as confusion matrices should always be available to the public. The field of isoscape analyses, where most publications follow a similar workflow and script based on Bataille et al. (2018, 2020), provides a good example (Serna et al. 2020; Janzen et al. 2020; Barberena et al. 2021; Holt et al. 2021; Bataille et al. 2021) fostering a community of practitioners rather than individual practices from isolated individuals. This approach could be a goal for different archaeological subfields aiming to develop machine learning applications and create a cohesive community. In their study, Batist and Roe (2024, table 5, figure 6) highlighted a similar phenomenon in an open archaeological dataset for specific communities (radiocarbon, database) that have very strong ties and an active and a solid sharing policy.

Another major concern is the difficulty of clearly delimiting the extent of machine learning. The definition used here was particularly strict, excluding algorithms that are more traditional statistical methods or mainly deal with dimensionality reduction, even if they were reported in many of the articles reviewed here as indeed machine learning, such as k-means, PAM/k-median (Mircea et al. 2015b; Altaweel and Squitieri 2019; Febriawan et al. 2020; Cacciari and Pocobelli 2021; Demján et al. 2022b; Bouzid and Barge 2022; Fernee and Trimmis 2022; Badawy et al. 2022) as well as PCA, LDA, and QDA (Monna et al. 2020; Ma et al. 2021; Abellán et al. 2022; Anglisano et al. 2022; Badawy et al. 2022). It is not easy to draw a line between what is machine learning and what is not, since it extends across the fields of both statistics and computer science, if not more (Bzdok 2017; Bzdok et al. 2018). Therefore, we do not wish to suggest our definition is conclusive, but we must also highlight the varied understanding of the concept across authors, which can make an analysis of machine learning applications in a specific field challenging.

Data availability and open access to data are often discussed as big challenges in archaeological research. In general, the entire workflow, from original data collection to publication, should be accessible. Following the FAIR principles (Wilkinson et al. 2016), data can be stored in open access data articles (e.g. Journal of Open Archaeology Data) or in institutional and international platforms (e.g. Zenodo, OSF, FigShare). The code should also be accessible either on an online platform (GitHub, GitLab) or as supplementary materials. Moreover, all the results, including those not discussed, should be available to the reader. Adopting a FAIR workflow and open-access data gives the opportunity to share methodologies and outcomes to a wider public, and potentially aid in the preservation of cultural heritage (Fisher et al. 2021). The Peer Community Journal Archaeology (PCI Archaeology) provides a good example of a fully accessible and FAIR process, requesting the open publication of all data, scripts, and code used for the published study, as well as a clear and thorough reporting of all the methods used, enough to be independently reproducible. A list of software and platforms adapted for increasing reproducibility in archaeological research, although slightly outdated, is given in supplementary tables of Strupler and Wilkinson (2017).

A major theoretical issue in many studies is the absence of clear archaeological questions that require, or at least are well suited to, machine learning methodologies to be answered. Authors develop proofs of concept without clearly stating the problems that necessitate the application of machine learning (Albertini et al. 2017; Stott et al. 2019; Gallwey et al. 2019; Ramazzotti 2020; Ushizima et al. 2020; Vos et al. 2021; Dhivya and Devi 2021; Lyons et al. 2022). This issue was already noted in remote sensing applications in African archaeology: “Much of the recent literature employing new analytical methods for remote sensing is purely experimental and thus is interested solely in developing methods that can be more widely applied by future work” (Davis and Douglass 2020). Clear, well-defined research questions are essential before applying machine learning methods, as simpler statistical solutions may suffice in many cases. The “theory in, theory out” concept developed by Radford and Joseph (2020, fig. 1) meets all these requirements, with a theoretical statement prior to the model design and a reflection on the model’s efficiency and its limitations for future theoretical implementation after the model has been run. This gap between research questions (and also theory) and data interpretation is important not only for digital applications but more broadly to all analyses performed on archaeological data, as has been underlined by Perreault (2019).

Furthermore, the validity and (especially inter-rater) reliability of the datasets used are in many cases not questioned enough (Tennie 2023), while it is an essential element for reproducibility. Another issue is noise; the outcomes may be valid and reliable if analysed together, but may have low precision. All these problems will become more urgent once a larger transition to machine learning-based methods has happened, and no traditional benchmarks exist to compare the results obtained from the models to a broader archaeological context. Besides the fact that the traditional benchmarks are often poor in validity and reliability themselves, the opacity (see black box in section 4.1) of many machine learning model families (especially artificial neural networks, which have recently surged in popularity) renders the whole problem even more difficult to solve. One key solution could reside in the development of explainable artificial intelligence (XAI), which intends to bring more transparency throughout the entire automation process (Barredo Arrieta et al. 2020).

Interdisciplinarity and collaboration pose another challenge. Archaeologists should have the primary say in formulating research questions, even when working with computer scientists (Orton 1980, p. 15). For example, the archaeology of early modern buildings integrates archaeologists and architects, yet the research questions ought to remain archaeological in nature for an archaeological study. Methodological questions can be developed by computer scientists, but the primary goals must stay aligned with archaeological objectives (Figure 11).

Another important theoretical issue is the reflection throughout the automation process, which also corresponds to the concept of “theory in, theory out”. As mentioned earlier, the use of the Knowledge Discovery Process (Hörr 2011; Hörr et al. 2014) represents an innovative and holistic approach to fully understand an archaeological artefact. It involves an initial clustering or unsupervised approach, followed by a supervised approach, with an optional semi-supervised step in between. This process encourages deeper reflection, a more subjective approach and more informed decision making. It is also a matter of taking the time to label the various data, as Klassen et al. state: “Given the nature of archaeological data, it is often difficult or expensive to get “labels” for things like artifact typologies and site chronologies” (Klassen et al. 2018). Therefore, standardised ontologies and semantic consistency are needed to assure the interoperability of the methods and results (Huggett 2017; Davis, 2020a).

Finally, while researchers can have doubts about the validity of results obtained through some applications of machine learning, as well as the validity of its underlying reasoning and the validity of the way these methods were applied (see Varoquaux and Cheplygina 2022), and all these aspects need to be critically examined, they also have great potential. They can reveal unseen relationships within a dataset, moving beyond human interpretation to a more pragmatic system. As Ramazzotti (2020, p. 174) states, machine learning in archaeology can recreate “a possible world of other associations of meaning devoid of sources and dispersed information, it exhibits the nuances and complex interrelations and, furthermore, it helps the interpreter codify other associations that were unforeseen (or hidden)”. Some might even consider highly mechanised and automated processes (machine learning being only one of many) as a new step into human technical gesture (Leroi-Gourhan 2022, p. 74).

4.4. Trends and perspectives

Machine learning evolves rapidly, and the research presented in this article was based on articles published before 2023. New algorithms and applications that are not covered in our review have already emerged in the field of archaeology (e.g. LLMs, cf. Agapiou and Lysandrou 2023; Cobb 2023; Lapp and Lapp 2024). Any exhaustive literature review takes considerable time to accomplish, and it is difficult to remain up to date to a set of technologies so quickly evolving as is machine learning. Nonetheless, our review highlights several potential trends for the future of machine learning in archaeology.

Firstly, regarding the general development of machine learning in archaeology, the academic background of new students will evolve. We identified four programs that focused on computer science applications in archaeology. The “Computational Archaeology: GIS, Data Science and Complexity” master at UCL in London, the “Digital Archaeology” master at York University, the “Digital and Computational Archaeology” programmme at Universität zu Köln in Cologne and the “Master in Digital Heritage and Landscape Archaeology” at the University of Cyprus in Nicosia. This number will undoubtedly grow, and machine learning will become as familiar to the next generation of archaeologists as GIS and remote sensing (Agapiou and Lysandrou 2015) or open-source principles (Batist and Roe 2024) are today. Although not explored in this review, explainable artificial intelligence is likely to be trialled for many applications, as it also enables the archaeologist to see the hidden relationships between their data, processing workflow, and results. Archaeological development of such tools can allow more interpretability either during the process (Labba et al. 2023) or post hoc (Tenzer et al. 2024).

Secondly, for image recognition tasks, particularly those related to cultural heritage and architecture, image segmentation will become an essential preprocessing step (Grilli et al. 2018). While already widely used, integrating segmentation at the beginning of a pipeline and linking it to a later classification algorithm is rare (Grilli and Remondino 2019; Nogales et al. 2021). This remark is also valid for any automatic structure detection tasks, but the large use of MR-CNNs suggests that many of the practitioners are already aware of the importance and usefulness of semantic image segmentation.

Regarding automatic structure detection, it seems that, without a doubt, economic opportunities will emerge for such applications. Private and public companies seek to manage costs and risks associated with new installations and construction, and automatic structure detection can help avoid unnecessary excavations. A private company in Ireland has already tested the ADAF model (Čož et al. 2024) to reduce archaeological intervention time during track construction, by predicting unseen and unreferenced archaeological mounds. Archaeological predictive models might also be regarded as potential tools for construction companies or even public administration to regulate their cultural heritage conservation policy. However, it is crucial to use these tools responsibly, acknowledging that machine learning models are not infallible, nor can they replace manual fieldwork. These tools, if applied commercially, ought to be seen as an avenue to minimise or at the very least better predict the impact of archaeological material during construction, but never as a way to avoid archaeological intervention. In fact, it might be possible to observe or even prevent the destruction of archaeological material (El-Hajj 2021), especially with widely verified and trusted methodologies. Both software developers and archaeologists must clearly define thresholds for feature classification or prediction and provide diverse datasets to ensure comprehensive predictions to avoid overlooking archaeological features.

The future development of new algorithms for classification of artefacts is another key area. Unlike automatic structure detection, where MR-CNNs dominate, there are no standard models in artefact classification. Although ANNs have taken the lead, as they have in other disciplines (Zheng et al. 2021; Osco et al. 2021; Thai 2022), a few dominant models, whether within the ANN family of methods or not, will likely emerge. ResNet is already very popular and well suited for image recognition problems (Gualandi et al. 2021; Anichini et al. 2021; Kowlessar et al. 2021; Pawlowicz and Downum 2021; Berganzo-Besga et al. 2021; Jalandoni et al. 2022), despite its relative antiquity (He et al. 2016). Furthermore, the rising amount of data used will favour deep learning models compared to other types of models (e.g. ensemble learning, Bayesian classifier), due to its higher number of parameters (Sarker 2021).

In the case of APMs, ensemble learning methods like tuned RF and MaxEnt, despite being several decades old, are well-suited to APM (MaxEnt in particular), which does not require data points for absence of features (Yaworsky et al. 2020, 2024a, b). These methods will likely remain in use for some years due to the special situation of APMs. An alternative is to employ deep learning models coupled with interpretability tools such as LIME (Ribeiro et al. 2016) or SHAP (Lundberg and Lee 2017), which were not applied in our corpus, and (to our knowledge) have yet to be applied in archaeology. One likely solution will be the development of studies combining traditional statistical approaches with supervised or unsupervised machine learning approaches to hopefully obtain more readable and interpretable results (Li et al. 2024).

Finally, we have not explored large language models (LLMs) in detail. The release of GPT-3.5 in 2022 opened many new possibilities. (Brown et al. 2020). Our review did not cover these developments as we excluded any articles published after 2022. However, LLMs have already shown promising results in archaeology for teaching (Cobb 2023) and literature scraping, as well as classification (Agapiou and Lysandrou 2023; Bellat et al. 2024). LLMs have even been applied to the automatic classification of artefacts with partially successful results (Lapp and Lapp 2024). In the future, LLMs could also be used for code generation or translation between programming languages. While Unnatural-Code-LLaMA-34B excels in generating new code, GPT-4 is the better option for code translation (Zheng et al. 2024). LLMs could make coding more accessible to a wider audience of archaeologists and are likely to increase the number of publications on machine learning applications in archaeology. They are not without their concerns, however, as many academic institutions (University of Tübingen 2023; Oxford University 2024; Massachusetts Institute of Technology 2024; Wang et al. 2024) or even European deciders (European Commission et al. 2024) have already set up guidelines to control and regulate the use of such models.

5. Conclusions

Our extensive review of applications of machine learning in archaeological research, which includes both quantitative and qualitative observations, highlighted several key phenomena. Expanding on Bickler’s assertion that: “Machine learning arrives in archaeology” (Bickler 2021), our findings suggest that machine learning has been widely present in archaeology since at least 2019. The adoption of machine learning techniques has been slower than in other fields, likely due to the inherent inertia of the discipline. To understand the diversity of machine learning approaches used in archaeology, we focused primarily on four parameters: archaeological subfields, family of algorithms applied, model evaluation goal, and scientific task. The initial wave of machine learning applications focused on artefact classification and bioarchaeological problems (Nguifo et al. 1997; Toler-Franklin et al. 2010; Ionescu 2015; Mircea et al. 2015a). Automatic structure detection has strongly developed after 2019, likely due to the proliferation of image recognition algorithms based on convolutional neural networks (CNNs) and segmentation models such as Multi-Scale Region-Based CNNs, especially since 2021. The heterogeneity of models used in archaeology contrasts with other fields such as medical imaging applications, where CNNs and generative adversarial networks (GANs) are predominantly used (Barragán-Montero et al. 2021).

Our work also highlighted some methodological concerns likely arising from the limited background of many archaeologists in machine learning, as well as of some programmers in relation to archaeological problems. Davis and Douglass have already pointed out this disconnect in remote sensing applications in archaeology, noting a “disconnect between remote sensing applications and anthropological theory” (Davis and Douglass 2020). Another critical aspect is the reliability and variability of the benchmarks themselves, which are often poor, leading to additional problems in evaluating the models (Tennie 2023). Furthermore, the suitability of the chosen machine learning methodology to answering the scientific question sought after is crucial: in many cases, statistical methods are sufficient to provide an answer, possibly rendering machine learning methods superfluous. To address the challenges of applying these methods to archaeological questions, we have designed a flowchart of suggested best practices (Figure 11) to help archaeologists develop a coherent and effective approach. This small tool aims to bridge the gap between archaeological research and machine learning, promoting a more integrated and informed application of these technologies in the field.

Regarding the future of machine learning in archaeology, three key phenomena emerge. Firstly, archaeologists’ ongoing training in digital methods will lead to a more comprehensive workflow and new applications in sub-fields of archaeology little represented in our current corpus (e.g., ethnoarchaeology, archaeobotany). Improving skills and knowledge will also reduce possible methodological issues, and provide a more direct interaction between the original research questions and the available avenues to solve them. Secondly, the constant development of new models will leave space for new proof-of-concept studies with more complex and wider datasets, but also allow for results that should improvement in results in some domains (cultural heritage, text analysis, automatic structure detection, and artefact classification). Thirdly, the recent boom in large language models and generative AI will change research in archaeology at different levels, as it does already for other fields of academia (Andersen et al. 2025). Either by allowing the creation of helpful tools for data curation, teaching tools, or multivariate analysis, these models are easy to use and have already been adopted by many in their daily life (though ethical issues remain; see Hagendorff 2024). One trend whose path remains uncertain is the possible formation of a reflexive community for AI and machine learning applications in archaeology (e.g. such as the COST action MAiA: MAIA 2025; or CAA machine learning special interest group), leading to more standardised practices and protocols with ethical guidelines.

Despite the many concerns on applying machine learning in archaeology, these methods can be very powerful tools, able to massively reduce the time, cost, and (whether positive or not) labour to process, analyse, and predict even large-scale and complex data (e.g. Kochkov et al. 2024). The existing reported successes of machine learning in other areas of science, and its already increasing number of applications in archaeology in recent years, all show it is cementing its place in the future of the field. We are likely at – if not already in a post-digital area (Gattiglia 2025) – a turning point in archaeology, as well as in other disciplines, where a new approach – machine learning – has the possibility to permeate all levels of archaeological practice. Like any other statistical tool, machine learning itself is neither inherently good nor bad; its impact on the field of archaeology depends on how it is used (Huggett 2017). We should be mindful of the thinking derived from post-processual archaeology in the late 1970s (Hodder 1986), developed in response to processual archaeology (Phillips and Willey 1953). The answer of this post-processual archaeology was to: “explain the past rather than describing it” (Yoffee and Fowles 2010). Following this perspective, machine learning applications in archaeology should never be limited to data processing or analysis but should be integrated into a broader archaeological reflection with precise questions, albeit knowingly and carefully, making sure its limitations are well known, well reported, and well addressed.

Appendices

Annexes

Annexe 1

List of algorithms used present in the papers under review organized study cases reviewed, organised by the approach and family of analysis methods they were categorised in, along with their abbreviations and number of use. In the case the model was compared to others, we highlighted the, number of times he performed used in our corpus, and number of times selected by the authors of a study case as the best algorithm when performing a comparison of various models.

FAMILY	MODEL	ACRONYM	NB. OF USES	NB. OF TIME BEST
Artificial Neural Network	Feedforward Neural Network	FNN	23	4
	Convolutional Neural Network	CNN	14	1
	Residual Neural Network	ResNet	12	2
	Mask Region-based Convolutional Neural Network	MR-CNN	9	1
	Faster Region-based Convolutional Neural Network	FR-CNN	8	0
	Visual Geometry Group	VGG	8	2
	U-Net	U-Net	7	4
	Inception Network	INC	4	1
	AlexNet	AlexNet	3	0
	RetinaNet	RN	3	0
	YOLO	YOLO	3	0
	DeepLabv3+	DL3	2	0
	Semantic Segmentation Model	SegNet	2	0
	Adaptive Deep Learning for Fine-grained Image Retrieval	ADLFIG	1	0
	Bidirectional Encoder Representations from Transformer	BERT	1	0
	Bidirectional Gated Recurrent Unit	BiGRU	1	0
	Bidirectional Long Short-Term Memory Network	BiLSTM	1	0
	Dynamic Graph Convolutional Neural Network	DGCNN	1	0
	DenseNet201	DN201	1	0
	Generative Adversarial Network	GAN	1	0
	Jason 2	JAS2	1	0
	Neural Support Vector Machine	NSVM	1	0
	Region-based Convolutional Neural Network	R-CNN	1	0
	Simple Network	SimpleNet	1	0
	Single Shot MultiBox Detector	SSD	1	0
Bayesian Classifier	Naïve Bayes	NB	11	0
Bayesian Classifier	Maximum Entropy	MaxEnt	2	0
Decision Trees and Rule Induction	C5.0	C5.0	7	2
	C4.5	C4.5	4	0
	Decision Tree/Classification Tree	DT	4	0
	Conditional Inference Trees	CTREE	2	0
	Iterative Dichotomiser 3	ID3	2	0
	Classification And Regression Tree	CART	1	0
	Fast and Frugal Tree	FFT	1	0
	Learning with Galois Lattice	LEGAL	1	0
	Representative Trees	REPTree	1	0
	Random Trees	RT	1	0
Ensemble Learning	Random Forest	RF	54	20
	Adaptative Boost	AdaBoost	2	0
	Stochastic Gradient Boosting	SGB	2	1
	eXtreme Gradient Boosting	XGB	2	1
	Bootstrap Agreggating	BAgg	1	0
	Discrete Super Learner	dSL	1	0
	Fast Random Forest	FRF	1	0
	Gradient boosting Regression Tree	GboostRT	1	0
	LogitBoost	LB	1	0
	Quantile Random Forest	QRF	1	0
	Sequential Backward Selection-Random Forest Regression	SBS-RFR	1	1
	Synthetic Minority Over-sampling Technique Boost	SMOTEBoost	1	0
	Synthetic Minority Oversampling Technique + Edited Nearest Neighbor Rule	SMOTEENN	1	0
	Super Learner	SP	1	1
	Viola-Jones Cascade Classifier	VL-CC	1	0
Genetic Algorithm	Genetic Algorithm	GA	1	0
Linear Classifier	Support Vector Machine	SVM	26	2
Linear Classifier	Structured Support Vector Machine	SSVM	1	0
Nearest Neighbour Classifier	k-nearest neighbors	kNN	19	1
Nearest Neighbour Classifier	Weighted k-nearest neighbors	kkNN	3	0
Polynomial Classifier	Support Vector Machine with Radial Basis Function Kernel	SVMr	7	1
Unsupervised Learning and Clustering	Affinity Propagation	AF	1	0
	Hierarchical Cluster-Based Peak Alignment	CluPA	1	0
	Databionic Swarm	DBS	1	0
	Expectation-Maximisation Clustering	EMC	1	0
	Graph-based Semi-Supervised Learning	GSSL	1	1
	Iterative Closest Point	ICP	1	0
	Iterative Self-Organizing Data Analysis	ISODATA	1	0
	Nearest Centroid	NC	1	0
	Simple Linear Iterative Clustering	SLIC	1	0
	Self-Organizing Map	SOM	1	0
	Tilburg Memory-Based Learning	TiMBL	1	0
	Time series clustering	TSC	1	0

Notes

[1] https://sslarch.github.io/.

Glossary

ALS = Airborne laser scanning
APM = Archaeological predictive models
GIS = Geographic information system
GPR = Ground penetrating radar
ICP = Iterative closest point
LDA = Linear discriminant analysis
LiDAR = Light detection and ranging
LogR = Logistic regression
MCC = Matthews correlation coefficient
MDA = Mixture discriminant analysis
ML = Machine learning
NLP = Natural language processing
PCA = Principal component analysis
QDA = Quadratic discriminant analysis
ROC = Receiver operating characteristic
UAV = Unmanned aerial vehicle
XRF = X-ray fluorescence

Data Accessibility Statement

The data used in this study are openly available in an OSF repository at https://doi.org/10.17605/OSF.IO/RUPGY. Codes are on GitHub on “Google-Scholar-Scraper” (https://github.com/ac-jorellanaf/Google-Scholar-Scraper) for the manual screening and “ML_archaeology_review” (https://github.com/mathias-bellat/ML_archaeology_review) for the automatic screening and figures.

Declaration of Generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT 3.5 (Brown et al. 2020) in order to generate part of the R code used for the automatic screening protocol searches and figure plotting. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication.

Acknowledgements

We would like to thank J. Padarian for providing help on scraping strategy and Á. Tejero-Cantero for the generous help and patience afforded during the earlier stage of the project.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

M.B., J.D.O.F., C.T. and T.S. conceived the idea for the manuscript and initiated and designed the research. M.B. and J.D.O.F., collected data and selected the articles. M.B. and J.D.O.F., performed the statistical analysis and interpreted the results. M.B., J.D.O.F., and T.S. wrote the manuscript, with critical input from all co-authors. All authors reviewed the manuscript.