Archaeological Classification of Small Datasets Using Meta- and Transfer Learning Methods: A Case Study on Hittite Stele Fragments

Deniz Kayıkcı; Iban Berganzo-Besga; Juan Antonio Barceló

doi:10.5334/jcaa.196

Full Article

1. Introduction

This paper tackles a fundamental problem in archaeological practice: the classification of artifacts with machine learning via image data when documentation quality varies considerably because:

only a small number of items are available for building a training/testing dataset;
objects are fragmented or degraded.

The significance lies in developing methods that can work effectively within these constraints rather than requiring ideal conditions. The specific task involves classifying Hittite stele fragments according to their original geographical provenance (Alacahöyük, Arslantepe, Karkamış, or Sakçagözü) based purely on visual attributes.

Archaeologists face significant challenges in determining artifact provenance, particularly when deformation and deterioration have obscured crucial diagnostic features. While expert assessment can make inferences about an artifact’s origin even in cases of substantial damage, this process often becomes extraordinarily difficult, if not impossible. Utilizing machine learning (ML) for assistance in this domain presents several comprehensive challenges. Primary among these are the insufficient number of examples available in archaeological databases for ML model training, poor documentation practices, and the heterogeneous, non-standardized structure of excavation archives. Different research and excavation teams have developed disparate documentation standards over time. Furthermore, inconsistencies such as low-quality imagery from earlier periods and non-standardized and incomplete datasets render the classification and interpretation of artifacts by ML exceedingly difficult. This situation threatens the accuracy and integrity of archaeological data, leading to information loss. Nevertheless, machine learning potentially offers a solution to overcome these obstacles. However, at this juncture, one confronts the reality that archaeological datasets typically do not reach the volume and sample size necessary for ML applications. This study investigates the applicability of advanced ML techniques for cases where definitive provenance information is challenging. Through a simulation approach addressing real-world archaeological problems, we evaluate the effectiveness of ML methods on a Hittite stele dataset characterized by limited examples per class, low-resolution and poor-quality images, and restricted contextual information. Our objective is to establish a proof of concept demonstrating the viability of an ML model capable of operating within such constraints, thereby contributing to a potential solution that could enhance classification and identification processes in archaeology.

Artifact clustering and classification have long been a pivotal method for organizing archaeological materials, serving as a lens through which the material culture of the past is interpreted and connected to broader human social and cultural systems (Krieger 1944). In this way, archaeologists learn to explain new observations based on the experience gained through the observation of “similar” cases in known historical contexts.

The systematic classification of archaeological artifacts based on various attributes is a fundamental yet deceptively complex and the most time-consuming methodological framework (Doran and Hodson 1975; Dunnell 1986; Djindjian 2015). It is based on measuring the “similarity” or amount of resemblance between pairs of objects. In general, two entities are similar if they have many properties in common, such as shape/form, visual appearance, or material composition. The underlying assumption seems to be: what seems similar in shape/form, visual appearance, and/or material composition should have been ‘made’ in the same way, by the same people at the same moment (Fahlander 2008). Numerical taxonomy tools have been traditionally used in archaeology for calculating pairs of “similar” enough artifacts and building classes that may act as references for ideal explanatory concepts (Clarke 1968; Doran and Hodson 1975; Read 2007; MacLeod 2018; Barceló & Bogdanovic 2015; Barceló et al. 2022). Intrinsically quantitative features like shape/form, size, and material composition are therefore the basis of archaeological statistical classifications. Problems begin when we have to classify visual attributes, such as symbols and iconographical representations (Shelley 1996; James 2015; Robinson 2020).

The traditional practice of visual-based archaeological classification has historically been a domain characterized by subjective interpretations and ad hoc decisions (Adams & Adams 1991). Artificial intelligence (AI) and machine learning (ML) methods present an opportunity to introduce a more rational analytical approach to classify image data due to their mathematical background to detect patterns, avoiding the usual simplification of visual inputs transformed into binary data (presence/absence of a particular visual element). The goal is to design and implement a computer model trained using known images that is then able to predict the output (the functional, chronological, or provenience “class” or “type”) when a new image is presented, even if the input is a fragment and lacks some of the visual features that have contributed to defining that “type” (Barceló 2009; Benhabiles and Tabia 2016; Engel et al. 2019; Chetouani et al. 2020; Navarro et al. 2021; Zidane et al. 2022; Ruschioni et al. 2023; Ling et al. 2024).

Artificial intelligence (AI) is a field that aims to transfer human-specific capabilities such as perception, learning, reasoning, interpretation, and prediction to automated machines through the use of software, hardware, and technology within computer science. Machine learning (ML) refers to decision-making processes based on predictions that emerge from applying quantitative methods such as regression and correlation. ML algorithms can learn effectively from images using different approaches: “Supervised learning”, which involves learning and making predictions based on previously processed (labelled) data; “Unsupervised learning”, which is a form of machine learning where the machine teaches itself with minimal (non-labelled) data input from the user. “Reinforcement learning”, on the other hand, occurs without any dataset and involves models that improve themselves using feedback and experience. It typically takes place in a simulated environment. (Luger 2004; Bishop 2006; Konar 2006; Russell & Norvig 2012; LeCun, Bengio & Hinton 2015; Schmidhuber 2015; Bengio, Goodfellow & Courville 2017; Burkov 2019; Alpaydin 2021; Chollet 2021; Géron 2022).

Artificial neural networks (ANNs) are computational models designed to recognize patterns and solve complex problems. The perceptron, introduced by Frank Rosenblatt in 1958, is the simplest form of an ANN, functioning as a binary classifier using a linear function. However, its inability to solve non-linear problems led to the development of multilayer perceptrons (MLPs), which consist of multiple layers, including hidden layers, and use non-linear activation functions to learn complex patterns through backpropagation. Computer scientists usually extend this concept by employing multilayer neural networks with many layers, leading towards a form of “deep” learning (DL), enabling the automatic extraction of hierarchical features from data. This approach has revolutionized fields like computer vision and natural language processing, driven by large datasets, powerful computational resources, and advanced training algorithms (LeCun, Bengio & Hinton 2015). Transfer learning allows models trained on one task to be adapted for another, reducing the necessity for large datasets and training time (Bengio, Goodfellow & Courville 2017; Burkov 2019; Chollet 2021; Géron 2022). Meta-learning, or “learning to learn,” focuses on improving the learning process itself, enabling models to quickly adapt to new tasks with minimal data. It includes few-shot learning (FSL) and is a part of Auto-ML methodologies (Schmidhuber 1987; Thrun & Pratt 1998; Vanschoren 2018; Hospedales et al. 2021).

ANNs hold significant potential for classifying archaeological artifacts by processing not only discrete visual features but also their spatial relationships, thereby uncovering non-linear patterns and trends in partially overlapping classes of artifacts. However, applying ANNs in archaeology presents substantial challenges, particularly in achieving optimal generalization. Balancing the complexity of the neural network model—the number of parameters—and the input-output relationships it must learn is critical. Insufficient parameters in a simple input-output function may result in “underfitting”, where the model fails to capture significant patterns and nuances within the data, thereby limiting its utility for archaeological interpretation. Conversely, increasing the number of input-output pairs and model parameters enhances the risks of “overfitting”, where the model learns noise or irrelevant patterns in the training data, achieving high accuracy on training sets but performing poorly on validation or unseen datasets. Overfitting not only degrades generalization but also undermines the interpretability of results, a crucial concern in archaeology where the goal is to derive meaningful insights from limited datasets (Pothuganti 2018). Furthermore, we should take into account the effects of the bias-variance trade-off, which is just as important as optimization and generalization in the definition of overfitting and underfitting. In this case, bias refers to the error introduced by an excessively simplified model, which can cause the model to miss relevant relations between features and target outputs (underfitting). Variance refers to the error introduced by the model’s sensitivity to small fluctuations in the training dataset. High variance can cause the model to capture noise in the training data as if it were a genuine pattern (overfitting) (Belkin et al. 2019).

Addressing these issues requires navigating the dual challenge of “optimization” and “generalization”, core to all machine learning applications. Optimization involves tuning the model to minimize errors on training data, while generalization assesses its performance on unseen data. Initially, both training and validation losses decrease as the model learns relevant patterns, but beyond a certain point, overfitting emerges, evidenced by degradation in validation performance. Mitigating this requires either augmenting training data or employing techniques to constrain model capacity, such as regularization, dropout, or simplified architectures, to focus the model on significant patterns (Wong et al. 2016). Ultimately, the success of ANNs in archaeology depends on aligning the complexity of the model with the scope and variability of data available, minimizing training error while ensuring the model generalizes effectively. Such careful calibration enables the predictive modeling of historical provenance, typologies, and other essential archaeological inferences, advancing the discipline through innovative AI methodologies (Barceló 1995; Barceló 2009; Deravignone, Blankholm & Pizziolo 2015; Sharafi et al. 2016; Barone et al. 2019; Trier, Cowley & Waldeland 2019; Mantovan & Nanni 2020; Gualandi, Gattiglia & Anichini 2021; Guyot, Lennon & Hubert-Moy 2021; Barceló et al. 2022; Jamil et al. 2022; Kadhim & Abed 2023; Ling et al. 2024; Yalov-Handzel, Cohen & Aperstein 2024).

The possibilities offered by modern machine learning (ML) to archaeological classification issues are challenged by the special nature of archaeological datasets, limited in size and diversity by the usually random nature of archaeological findings. Limited training data can lead to overfitting, where models perform well on training samples but poorly on unseen data. This results in reduced generalization and accuracy. Insufficient test data may not adequately represent the true distribution, leading to unreliable performance estimates. Small datasets also increase sensitivity to noise and outliers, making it difficult to distinguish meaningful patterns from random fluctuations (Brigato & Iocchi 2021).

This is the trouble in archaeological applications: when there is not enough diverse data, the model may learn very specific features that do not generalize well to other data. Archaeologists depend on what they have found, and the amount of well-contextualized findings is, in many cases, not enough for properly training ANNs. Our challenge is then to design an ML approach optimized for learning when the training dataset appears to be small, and preserved variation in the sample is unrelated to the original variation in the ancient population.

In recent years, important papers have been published evaluating the fundamental problems and including general theories about how the integration of machine learning (ML) and deep learning (DL) transforms archaeological research by enabling automated classification, pattern recognition, and reconstruction of material culture, addressing longstanding challenges in data volume, interpretive subjectivity, and interdisciplinary synthesis (Bickler 2021; Barceló et al. 2022; Bellat et al. 2025; Gattiglia 2025). Some supplementary examples in the context of this discipline are mentioned below.

In pottery studies, the ArchAIDE project developed a mobile and desktop application that uses image recognition and machine learning to automate the classification of archaeological ceramics (Gattiglia & Wright 2018). Ensemble DL models combining VGG-16, Inception-v3, and GoogLeNet architectures have achieved over 95% accuracy in classifying ceramic provenance from microscopic images, enhancing non-destructive analysis of kiln-specific features (Wang, Xiao & Liu 2024). The FabricAI project employs CNNs trained on stereomicroscopic images of Roman coarse wares to automate fabric recognition with up to 99.1% accuracy, streamlining post-excavation workflows and GIS integration for trade pattern reconstruction (Willems, Chaidron & Borgers 2024).

Lithic analysis benefits from Faster R-CNN ResNet-50 models distinguishing anthropogenically modified artifacts from natural clasts with 100% concordance to expert classifications across diverse global datasets (Emmitt et al. 2022). In a similar study, Generative Adversarial Networks (GANs) for 3D core simulations predict spurious flake removals with high fidelity, facilitating reduction sequence reconstructions (Orellana Figueroa et al. 2021). Another study compares K-Means, hierarchical clustering, and Self-Organizing Maps to classify lithic artifacts from Pre-Pottery Neolithic B sites in the Southern Levant (Troiano et al. 2024). Another study compares Bayesian regularization and Levenberg-Marquardt training algorithms for predicting missing metrics in fragmented Neolithic laminar artifacts using neural networks. It demonstrates how tailored architectures and normalization strategies enhance performance on small archaeological datasets, supporting accurate reconstruction of blade dimensions (Troiano et al. 2024). A final study introduces the LUWA dataset, the first large-scale open-source collection of microscopic images for lithic use-wear analysis, enabling the classification of worked materials like bone, wood, and antler (Zhang et al. 2024).

For human bone and faunal bioarchaeology, ResNet-50 and Inception-v3 transfer learning models classify bone surface modifications with 96.3% accuracy, revealing Neanderthal use of hyena pelts through cut mark differentiation (Moclán et al. 2024), whereas critical evaluations of DL for taphonomic equifinality highlight dataset quality limitations, advocating for balanced, high-resolution imagery to mitigate overfitting in BSM classification (Courtenay et al. 2024). Another study introduces a novel approach to identifying early dog domestication by analyzing tooth marks using geometric morphometrics and machine learning. It demonstrates that bite traces from wolves and domestic dogs can be reliably distinguished, even in the absence of skeletal remains. The findings offer an indirect yet robust method for detecting domesticated canids in Palaeolithic archaeological contexts (Yravedra et al. 2019).

Rock art research leverages transfer learning with VGG19 and EfficientNet V2 S on 3,100 labeled images for motif classification in Indigenous Australian contexts, achieving 79.76% top-1 accuracy while fostering educational citizen science (Turner-Jones et al. 2024), and unsupervised clustering via Gaussian Mixture Models and Self-Organizing Maps on oculated idol patterns uncovers regional stylistic networks in Copper Age Iberia (Jiménez-Puerto 2024).

Settlement and landscape archaeology with remote sensing and LiDAR presents the most examples of the usage of ML-DL methods in archaeology. A case LiDAR study applies deep convolutional neural networks with transfer learning to LiDAR-derived terrain data for semi-automatic detection and segmentation of archaeological structures in Brittany, France. The approach demonstrates high accuracy in identifying and characterizing diverse topographic anomalies, offering scalable solutions for large-area archaeological prospection (Guyot, Lennon & Hubert-Moy 2021). Another study employs transfer learning on satellite imagery for Mesopotamian site detection with 80% accuracy, proposing hybrid human-AI workflows for validation (Casini et al. 2023). In parallel, CNNs on Bulgarian satellite data for burial mound identification, though revealing high false positives (87–95%) due to landscape heterogeneity and underscoring the need for robust negative training data (Sobotkova et al. 2024). Broader applications encompass DL pipelines detecting Nasca geoglyphs from aerial photos with 0.9 confidence thresholds, uncovering four novel motifs including humanoids and birds (Sakai et al. 2023). Another important study is one of the pioneers applying fuzzy logic and neural networks to spatial data from Māori pa sites, enabling nuanced classification and pattern recognition beyond conventional statistical methods. The hybrid approach reveals subtle cognitive and cultural factors in site construction, offering a refined model for archaeological interpretation (Reeler 1999). Another important study demonstrates the effectiveness of convolutional neural networks (CNNs) in detecting Early Iron Age “Saka” burial mounds across the Eurasian steppe using open-source satellite imagery. The CNN model outperforms traditional detection methods, enabling scalable, non-invasive archaeological site identification in politically and logistically challenging regions (Caspari & Crespo 2019).

Textual and epigraphic applications include VGG16-based systems for cuneiform symbol recognition on Hammurabi Code tablets, leveraging data augmentation for 90% detection accuracy in wedge-shaped inscriptions (Elshehaby et al. 2024), and Mask R-CNN within Detectron2 for segmenting Egyptian hieroglyphs on degraded papyri and stone, addressing stylistic variability with promising hyperparameter-tuned results (Guidi et al. 2023).

Numismatics advances through CoinNet, a DL model classifying Roman Republican coins via reverse motifs on deformed surfaces (Anwar et al. 2021), and Graph Transduction Games for semi-supervised coin type recognition, enabling accurate attribution with minimal labeled data (Aslan, Vascon & Pelillo 2020).

Generative AI and large language models (LLMs) facilitate ChatGPT-guided development of anomaly detection apps from aerial imagery, democratizing computational tools for non-programmers (Ciccone 2024), while NLP models like GPT-3 and BERT reconstruct fragmented ancient scripts such as Mayan glyphs, navigating data scarcity through pattern hypothesis generation (Koc 2025). Another example is a generative AI framework reconstructing Bronze Age vessels from 4,000+ fragments via regressor-reconstructor-denoiser models, validated at high fidelity by experts (Cardarelli 2024).

These innovations collectively underscore ML/DL’s transformative role in archaeology, bridging quantitative precision with interpretive depth, though persistent challenges like dataset imbalances and ethical biases necessitate interdisciplinary safeguards to ensure equitable heritage representation.

This study investigates the potential of artificial neural network (ANN)-based methods to overcome the constraints posed by limited training and testing datasets, with a specific focus on the classification of Hittite stele fragments. Hybrid methodologies incorporating model-agnostic meta-learning (MAML), few-shot learning (FSL), and transfer learning via a pre-trained ResNet18 architecture have been compared to classical ML algorithms. These techniques have demonstrated efficacy in training ML models on small datasets in other domains. This paper rigorously evaluates the performance of these approaches in addressing key challenges in artifact classification, including label scarcity, the impact of degradation and fragmentation on the data (analogous to noise in ML literature), and inconsistencies in documentation quality. Moreover, the study integrates domain-specific expertise from a Hittite art specialist to assess the practical implications of ML methodologies in artifact classification tasks. Framed as an “object localization” problem in ML terminology, the purpose is to learn the provenance of fragments of stelae from four different Hittite cities and predict the provenience of unclassified new items based on discovered differences and similarities in visual features. The results demonstrate that robust classification can be achieved with a minimal dataset of 136 labeled samples, highlighting the superiority of the transfer (pre-trained) model and the proposed hybrid model over conventional methods, as well as the advantages of simple FSL approaches addressing multiclass classification problems within archaeology.

The paper is structured as follows: Chapter 1 provides an overview of the research objectives and a summary of the state of the art; Chapter 2 describes the materials and gives an overview of the methodologies employed; Chapter 3 presents the analytical results; and Chapter 4 discusses the findings and their broader implications for archaeological research. In Appendix 1, the reader will find an exhaustive mathematical background for the methods used. The code for the programs used in the paper can be accessed at https://github.com/dkayikci/ML-Arch.

2. Materials and Methods

2.1 Data Selection

The Hittite Empire was a major Bronze Age power centered in Anatolia (modern-day Turkey) from approximately 1600–1200 BCE. It reached its height under King Suppiluliuma I (14th century BCE), when it rivaled Egypt for control over the Near East. The Hittites were pioneers in iron processing and chariot warfare, giving them significant military advantages. Their capital was Hattusa (near modern Boğazkale, Turkey), featuring impressive architectural achievements including massive stone walls and gates (Bryce 2002; Genz & Mielke 2011; Gilibert 2011; Van den Hout 2020).

Among the material culture, we focus here on the stone relief stelae (Hawkins 2014; Hundley 2014; Cammarosano 2015). These stelae, originating from ancient Hittite centers such as Alacahöyük, Arslantepe, Karkamış, and Sakçagözü (Figure 1), offer invaluable insights into the Hittite cultural and historical milieu, depicting military triumphs, religious ceremonies, and daily life. Despite their significance, many stelae remain unidentified due to fragmentation and uncertain provenance, often resulting from illicit trade or the lack of information in private collections. Our study addresses the challenge of classifying these artifacts using visual attributes not biased by any verbal description or prior identification and employing advanced machine learning techniques. The goal is to advance in the prediction of the original provenience of fragmented stelae, based on the visual information we may learn and generalize from well-preserved items, whose archaeological context and provenience are well known. The direct objective of the present research is to determine to which of the four candidate cities the examples in the test dataset (specifically, the side with relief art) belong, based on stylistic differences and stone surface textures using gradient optimization-based models. Stelae from regions like Karkamış and Sakçagözü in southern Anatolia and northern Syria show a blend of Hittite and local styles, incorporating motifs and techniques from neighboring cultures. If we can distinguish these elements, it would be possible to infer the provenience of a fragment preserved in a museum or a private collection in the absence of contextual information for its origin.

Hittite cities in the scope of the study.

To achieve these objectives, we have compared the performance of diverse machine learning methods, specifically meta-learning and transfer learning models, as well as hybrids of them, which have proven to be successful even with limited datasets. Additionally, a Hittite art expert conducted a classification to provide insights into the differences between AI and human expertise.

The image dataset utilized in this investigation was meticulously acquired by the first author (DK) under controlled conditions within the Ankara Anatolian Civilizations Museum’s Hittite Reliefs Exhibition Hall, where all photographic documentation was executed with consistent equipment parameters—identical camera model, uniform settings, and standardized ambient illumination conditions. We have detected minor variations in illumination intensity across the exhibition space and subtle angular deviations during photographic documentation. No effort has been made to eliminate those variations because the way to deal with the presence of noise and spurious variation aligns with the central research question underpinning our methodological approach.

For dataset preparation, we followed this procedure:

Data Collection: The raw dataset comprised 298 JPEG images with a resolution of 4928 × 3264 pixels of stelae from the four previously mentioned Hittite centers (Alacahöyük, Arslantepe, Karkamış, and Sakçagözü).
Data Cleaning: Duplicate images and those unrelated to the thematic focus were removed, resulting in a refined dataset of 88 images organized into separate folders based on the geographical location of the original stelae.
Image Preprocessing: The RGB color channel information was preserved throughout our computational processing pipeline to maintain rich visual information requisite for nuanced stylistic classification. The 88 images were resized to 1500 × 1000 pixels. Some regions of these images were cropped to generate 412 images simulating “fragmented stelae”. Cropped patches were chosen deliberately, emphasizing those parts of the stelae particularly hard to discern due to degradation and alteration (Figure 2). This approach aims to simulate real-world scenarios where archaeologists face challenges in classifying highly deformed fragments that cannot be easily identified.
Sample Selection: Of the 412 samples, we have processed only 136 to avoid differences in sample size for each provenience. Only 34 images are available from the Sakçagözü site. Our database contains the same number of stelae from each site.
Training/Testing Split and Final Decision: We ended up with 28 training (support) and 6 testing (query) samples in each of the 4 classes, totaling 112 samples for training and 24 samples for testing (approximately 82% to 18%). Therefore, after all these steps, the initial dataset of this study consisted of 136 samples. The small number of testing cases has been compensated for using an explicit evaluation procedure (as discussed further in the paper).

Cropped images were standardized to a resolution of 224 × 224 pixels (Figure 3). This resolution balances the risks of overfitting, which can occur at higher resolutions, and the loss of critical feature information at lower resolutions. This level of image preprocessing is a fundamental step in computer vision pipelines, ensuring raw image data is properly formatted for deep learning architectures. Standardization of pixel values to a consistent range, resizing, and cropping have been considered essential, particularly for constrained datasets like ours, to optimize computational efficiency and model performance (Bengio, Goodfellow & Courville 2017). Spatial resizing supports mini-batch learning, mitigates memory limitations, and accelerates training and inference by reducing computational overhead. Trade-offs between resolution and batch size directly impact recognition accuracy, as recent advancements in recognition-aware preprocessing demonstrate. However, the goal of such operations is not perceptual quality but rather the accurate performance of the recognition model (Talebi & Milanfar 2018).

This initial dataset has been further divided into training and testing subsets for each Hittite city, with 28 training (support) and 6 testing (query) samples per provenience set, maintaining an approximate 82% to 18% split (Figure 4). We are conscious that this may seem very small for testing the predictive capabilities of a machine learning model; however, we could not obtain more viable specimens. We have prioritized a balanced dataset with 6 test samples per city, directly addressing the conspicuous lack of stelae from Sakçagözü in the Ankara Anatolian Civilizations Museum’s Hittite Reliefs Exhibition Hall. Small test set size is not a methodological oversight, but a deliberate choice to simulate authentic archaeological conditions. No specific strategy for data augmentation has been considered here, given the implicit problems it may cause. Artificially increasing the size of the dataset through augmentation can give a false sense of security. While the dataset appears larger, it may not actually contain more diverse or informative examples. This can lead to overconfidence in the model’s performance (Wong et al. 2016).

Dataset: training data on the left (28 samples for each class) and testing data on the right (6 samples for each class).

2.2 Enhanced Dataset Implementation

To investigate the impact of increased data volume on model performance, an expanded version of the original dataset was created. The enhanced dataset featured a substantial increase in both training and testing instances: evolving from the original 136 instances (112 for training, 24 for testing; an 82.35%–17.65% split) to 208 total instances (152 for training, 56 for testing; a 73.08%–26.92% split). This expansion more than doubled the test set size, increasing it from 24 to 56 instances. A balanced distribution across the four archaeological classes was maintained, with the enhanced dataset providing 38 training samples per class (compared to 28 in the initial dataset) and 14 testing samples per class (compared to 6 previously).

The samples used for this expansion were selected from a reserve collection of digital photographs, featuring additional Stelae from the Ankara Museum of Anatolian Civilizations. These images were captured using identical equipment and under standardized conditions, consistent with the original dataset. All new samples underwent the same preprocessing, transformation, and cropping procedures as outlined in the methodology. Furthermore, model training parameters remained identical to those employed with the original dataset. This methodological consistency ensures direct comparability of results and allows for the isolation of data volume effects on classification performance. The considerably larger test set provides a more robust foundation for evaluating model generalization.

It is crucial to differentiate this dataset enhancement from traditional data augmentation techniques common in machine learning. The expansion was achieved by incorporating entirely novel, genuine samples from the reserve collection, rather than by applying synthetic manipulations (such as rotation or grayscale conversion) to existing images. The scope of this expansion was ultimately determined by the limited availability of artifacts from the Sakçagözü site, which dictated an upper boundary of 14 test samples per class. This constraint resulted in a balanced test dataset of 56 unique instances (14 samples × 4 classes), composed of authentic archaeological artifacts.

The dataset was expanded by 18 new samples per class, totaling 72 additional examples across the four classes. The allocation of these new instances was systematic: 10 examples per class (40 total) were added to the training subset, and 8 examples per class (32 total) were added to the testing subset. This differential proportional allocation, leading to the revised 73.08%–26,92% training–testing split, was a deliberate methodological choice. The primary aim was to significantly enlarge the test set, thereby providing a more rigorous framework for assessing the generalization capabilities of the models. The final dataset dimensions of 152 training samples and 56 testing samples represent a considerable volumetric increase. This approach ensured that while the training/testing proportions were adjusted, class balance was maintained, and the expanded dataset offered substantially improved conditions for robust model evaluation and the assessment of generalization.

2.3 Methodological Overview

We have compared the results of four different supervised learning approaches for predicting the provenience of our small set of Hittite stelae:

Classical Machine Learning Algorithms: Traditional approaches, including Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), Logistic Regression (LR), and Naive Bayes (NB).
Transfer learning: A pre-trained deep learning model, ResNet18, was utilized to leverage transfer learning for this task.
Simple CNN with FSL: A straightforward convolutional neural network (CNN) architecture was employed in conjunction with FSL. As it will be argued later, without the integration of FSL, the simple CNN architecture underperformed, failing to generalize and resulting in underfitting. Therefore, cross-validation has not been implemented in this model.
Hybrid Approach: A combination of the ResNet18 pre-trained network architecture and a gradient-based meta-learning algorithm (model-agnostic meta-learning, MAML) was applied, incorporating few-shot learning (FSL) techniques.

The performance of these four approaches was evaluated against the classification accuracy rate achieved by a human expert as a reference value, providing a benchmark for comparison and highlighting the relative efficacy of advanced ML methodologies in this context.

Convolutional neural networks (CNNs) are designed specifically to preserve spatial autocorrelation through convolutional filters that detect patterns across neighboring pixels. They allow hierarchical feature extraction, which captures both low-level textures and high-level shapes. Furthermore, they offer translation invariance that allows them to recognize features regardless of their position. This makes CNNs dramatically more effective for distinguishing the intricate carved details, weathering patterns, and stylistic elements that characterize different periods and regions of Hittite stone monuments.

Transfer learning (pretrained models) is an approach where knowledge gained from training a model on one task is applied to improve learning on a related task. A CNN is “pre-trained” on a large dataset (such as ImageNet; Deng et al. 2009) with millions of diverse images across thousands of categories. This teaches the network to recognize fundamental visual features like edges, textures, shapes, and complex patterns. The pre-trained network is then repurposed for a new, specialized task (such as Hittite stelae classification) by keeping the early and middle convolutional layers (which capture universal visual features) and replacing and retraining the final layers to recognize specific features relevant to the new task (Weiss, Khoshgoftaar & Wang 2016; Zhuang et al. 2020). The goal is to adjust the model’s weights to better suit the new task. In fine-tuning, some or all of the pre-trained layers are unfrozen, and the entire model is trained on the new dataset. Fine-tuning is a specific form of transfer learning where the pre-trained model’s layers are unfrozen and allowed to be updated during training on the new task. This allows the model to learn more task-specific features while still benefiting from the pre-trained weights (Schmidhuber 1987; Pan & Yang 2009; Vanschoren 2018; Hospedales et al. 2021).

ResNet18 is a CNN architecture with 18 layers, and it has been trained on over a million images from the ImageNet database (Stanford Vision Lab 2020). It can classify images into 1000 object categories and has learned rich feature representations for a wide range of images. It is an example of a residual neural network in which the layers learn residual functions with reference to the layer inputs (He et al. 2016). In this paper, we started with the pre-trained ResNet18 model; we removed the final classification layer, substituted it with our Hittite stelae classification task (four neurons, one for each of the four candidate Hittite centers: Alacahöyük, Arslantepe, Karkamış, and Sakçagözü), and fine-tuned the model with our image dataset from the Ankara Museum. We opted for the ResNet18 deep learning architecture because of its computational efficiency, attributable to its relatively low parameter count. Although the specific content of ancient engraved stelae may differ from the natural images in ImageNet, the pre-trained model can still provide a strong starting point. The model can be fine-tuned to adapt to the specific characteristics of the stelae images, such as the depiction of human figures holding objects. Given the limited size of our dataset, ResNet18 offers an optimal balance between model complexity and adaptability, facilitating effective learning while minimizing computational overhead (Liu et al. 2023).

For ResNet18 processing, we used standard convolutional filtering. In other models, to reduce the impact of small size and simplify the models, we used the scikit-learn library (Pedregosa et al. 2011) to transform multi-dimensional input tensors into flattened representations where image data with dimensions characterized by height (h), width (w), and channels (c) undergo transformation into single-dimensional vectors of length h × w × c.

Meta-learning, also known as “learning to learn,” is a subfield of machine learning where an algorithm gains experience over multiple learning episodes, using this experience to improve its learning performance over time. It is particularly useful when dealing with extremely limited datasets, as it allows the model to generalize from a few examples by leveraging knowledge acquired from previous tasks. In the meta-learning literature, the terms “training dataset” and “test dataset” are conceptually referred to as the “support dataset” and “query dataset”, respectively. Few-shot learning (FSL) is a special case of meta-learning designed to train models that can recognize new patterns or classes after seeing only a few examples, rather than requiring thousands of labeled samples. Beginning with a pre-trained CNN (transfer learning) to start the classification task with a strong foundation of general features and patterns, FSL allows a model to adapt its knowledge to a new domain with minimal examples (typically 1–5 per class). This is where the “few-shot” name comes from (Schmidhuber 1987; Naik & Mammone 1992; Thrun & Pratt 1998; Ravi & Larochelle 2017; Wang et al. 2020; Hospedales et al. 2021; Song et al. 2023).

Many few-shot techniques work by learning a similarity metric or embedding space where similar items cluster together. New examples are classified based on their proximity to the few support examples in this space. Techniques like model-agnostic meta-learning (MAML) train the model to be easily adaptable to new tasks with minimal gradient updates, essentially “learning how to learn” from a few examples. Its main goal is to find an optimal set of initial parameters that may serve as a strong starting point for quick adaptation across a wide range of tasks. During training, MAML uses a nested optimization procedure with two levels: a) In the inner loop, the model temporarily adapts its parameters to specific tasks using just a few examples (the “support set”) and a small number of gradient steps; b) in the outer loop, these adapted parameters are evaluated on new examples from the same tasks (the “query set”), and the loss is used to update the original initialization parameters. By repeating this process across many different tasks, MAML learns initial parameters that are positioned in parameter space such that small changes can produce good task-specific models. This makes the model inherently flexible and quick to adapt to previously unseen tasks during testing, requiring only a handful of examples to achieve good performance (Finn, Abbeel & Levine 2017; Vanschoren 2018; Hospedales et al. 2021).

MAML is particularly valuable because of its “model-agnostic” nature, which allows for its application to any architecture optimizable through gradient descent, including neural networks for classification, regression, and reinforcement learning. As archaeological datasets typically consist of limited samples, the meta-learning approach offers a more efficient and adaptable solution for artifact identification and classification. It can bridge the gap between AI capabilities and archaeological necessities. More details in Appendix 1, with an exhaustive mathematical background of some of these methods.

2.4 Implementation and parameter setting

For classical machine learning methods, 2D images of 224 × 224 pixels were flattened into 1D vectors of 50,176 components. This transformation of high-dimensional image data into vectorized numericals implies a necessary dimensionality reduction to allow for reasonable computer processing times (Chychkarov et al. 2021; Géron 2022). Details appear in Table 1.

Table 1

Software libraries and programming environment.

LIBRARY	VERSION
Python	3.11
PyTorch	2.3.0+cu121
Seaborn	0.13.2
Scikit-learn	1.2.2
Matplotlib	3.10.0
Torchvision	0.18.0+cu121
CUDA	12.1
Higher	0.2.1
Evograd	0.1.2
Easyfsl	1.5.0
Scipy	1.15.2
NumPy	1.25.2
Pandas	2.2.2
Threadpoolctl	3.6.0
Joblib	1.4.2
Google Collaboration	A-100 GPU

In our CNN models, the output layer is defined by four units corresponding to four categories: Alacahöyük, Arslantepe, Karkamış, and Sakçagözü, the four areas of provenience of the stelae according to museum documentation and annotated in each image. Each neuron outputs a probability via the softmax function, indicating the probability that a fragment originates from each city. This setup provides a probabilistic assessment across all classes, enabling accurate classification and aiding in the archaeological interpretation of these ancient artifacts. Table 2 highlights the different hyperparameter configurations used across the four model approaches in the study. It shows the progression from simple conventional machine learning models to more sophisticated neural network architectures with advanced optimization techniques and regularization strategies. For hybrid and transfer models, the same hyperparameters were used.

Table 2

Parameters of the initial dataset implemented models.

HYPERPARAMETER	CONVENTIONAL ML MODELS	TRANSFER LEARNING (RESNET18)	HYBRID (MAML+FSL+RESNET18)	SIMPLE CNN + FSL
Architecture Specific
Base architecture	N/A	ResNet18 (pretrained)	ResNet18 (pretrained)	Custom CNN
Frozen layers	N/A	layer1, layer2	layer1, layer2	None
Dropout rates	N/A	0.5, 0.3	0.5, 0.3	0.5
Learning Process
Batch size	N/A	16	16	8
Number of epochs	N/A	30	20	100
Optimization
Optimizer	N/A	Adam	Adam	SGD
Learning rate	N/A	0.001	Multi-tier: 0.001 (fc), 0.0001 (layer4), 0.00001 (layer3)	Multi-tier: 0.001 (conv), 0.01 (fc)
Weight decay	N/A	0.001	0.001	0.01
Momentum	N/A	N/A	N/A	0.9
Meta-Learning
Meta learning rate	N/A	N/A	0.0005	N/A
Inner learning rate	N/A	N/A	0.01	N/A
Inner updates	N/A	N/A	3	N/A
Meta updates	N/A	N/A	100	N/A
Regularization
L2 lambda	N/A	N/A	0.001	0.001
Gradient clip norm	N/A	N/A	1.0	1.0
Early Stopping
Patience	N/A	5	5	10
Loss Function
Function type	Varies by model	CrossEntropyLoss	CrossEntropyLoss with class weights	Focal Loss
Focal Loss gamma	N/A	N/A	N/A	2
Learning Rate Scheduling
Scheduler	N/A	ReduceLROnPlateau	ReduceLROnPlateau	ReduceLROnPlateau
Schedule factor	N/A	0.5	0.5	0.5
Schedule patience	N/A	5	5	5
Model-Specific Parameters
SVM	kernel=‘rbf’, random_state=42	N/A	N/A	N/A
KNN	n_neighbors=5	N/A	N/A	N/A
Random Forest	n_estimators=100, random_state=42	N/A	N/A	N/A
Logistic Regression	multi_class=‘ovr’, random_state=42, max_iter=1000	N/A	N/A	N/A
Decision Tree	random_state=42	N/A	N/A	N/A
Naive Bayes	Default parameters	N/A	N/A	N/A
Validation
Cross-validation folds	5 (3 for enhanced vs)	3	3	N/A
Random seed	42	123	123	123

2.5 Validation procedure

Training relied on a small “support set” of labeled examples for the four classes (112 images from stelae of known origin, 28 images per class: Hittite city). For preliminary validation, we implemented 6 known examples from each of the four proveniences, with a total of 24 images. A secondary validation with an enhanced dataset was also carried out with 14 samples for each class.

An initial procedure involved dividing the 112 training instances into three roughly equal subsets, or folds, each comprising approximately 37–38 samples. The cross-validation then proceeded iteratively: in each of the three iterations, two folds (around 74–75 samples) served as the training data for that specific iteration, while the third, remaining fold (around 37–38 samples) was utilized as the validation set. Key performance metrics such as accuracy, precision, recall, and F1-score were computed on this validation set in each iteration. The average of these metrics over the three folds yielded a more reliable estimate of the model’s generalization capability on data not seen during that iteration’s training. Crucially, the 24 test samples (6 per class) of the first dataset were held out entirely during this cross-validation process and were only engaged after model training and validation were complete, ensuring a final, unbiased assessment of the chosen model’s predictive accuracy on new unseen data (Burkov 2019; Alpaydin 2021; Géron 2022). For the enhanced dataset implementation, the models included 152 training samples and 56 testing samples, totaling 208 samples. In this case, the task size is 4-way 38-shot for support (training) and 4-way 14-shot for query (test).

3. Discussion

3.1 Preliminary results with the initial training and testing data set

The evaluation of machine learning approaches to Hittite stele classification was conducted across two distinct experimental phases to assess both baseline performance and scalability. The initial experimental phase utilized 136 samples (112 training, 24 testing) to establish proof-of-concept effectiveness under severely constrained data conditions. Subsequently, an enhanced dataset phase employed 208 samples (152 training, 56 testing) to evaluate performance improvements with increased data availability. Throughout both phases, human expert classification was conducted by an associate professor specializing in Hittite art, providing authoritative benchmarks for model validation. In the initial dataset phase, the human expert achieved 62.5% accuracy on the 24-sample test set, correctly classifying 15 of 24 fragments. This performance established our baseline reference for evaluating machine learning effectiveness under challenging conditions where even expert human classification faces significant difficulties due to fragment degradation and limited diagnostic features. The enhanced dataset phase revealed substantially improved human expert performance, achieving 85.7% accuracy on the expanded 56-sample test set, correctly classifying 48 of 56 fragments. This dramatic improvement (23.2 percentage points) demonstrates the critical importance of reference material availability for both human and machine-based archaeological classification.

The different methods of classifying Hittite stelae according to their provenance have been evaluated in terms of different parameters:

Precision: This measures the accuracy of the positive predictions made by the model. It is the ratio of correctly predicted positive observations to the total predicted positives. High precision indicates that the model has a low false positive rate, meaning that when it predicts a positive outcome, it is likely to be correct.
Accuracy: This measures the overall correctness of the model. It is the ratio of correctly predicted observations (both positive and negative) to the total observations. High accuracy indicates that the model is making correct predictions for a large portion of the dataset. However, accuracy can be misleading if the dataset is imbalanced.
Recall (Sensitivity): This is the ratio of correctly predicted positive observations to all observations in the actual class.
F1 Score: This is the harmonic mean of precision and recall (sensitivity), providing a single metric that balances both concerns. A high F1 score indicates that the model has a good balance between precision and recall.

As expected, our results indicate that statistical learning models, although they exhibit very high accuracy and precision (higher than 0.9) on training data, show very low performance scores (below 0.5) on test data. This is a clear indication of overfitting. Such models have classified the training data too well, including its noise and peculiarities, rather than learning the underlying patterns that may generalize to new, unseen data. This is a clear consequence of limiting the comparison to pixel values and not to the spatial organization of those same pixels. The reasons for this may be:

SVM operates in a high-dimensional feature space but treats each pixel as an independent dimension without considering its spatial relationships.
Decision Trees make binary splits on individual feature values, losing the ability to recognize patterns that depend on pixel neighborhoods.
Random Forests build multiple decision trees that make splits based on individual pixel values, but none of these decision trees can efficiently capture patterns that span multiple adjacent pixels.
Logistic Regression uses linear combinations of individual features, making it unsuitable for capturing complex, non-linear spatial patterns in archaeological imagery.

Transfer and meta-learning methods show far better results in testing generalizations and predictive power, arriving at 81–83% of correctly classified items in the test stage, using, however, very small datasets for this testing. Although the small size of the testing database may compromise the results, we should consider that machine learning results are better than those obtained by a voluntary human expert, an associate professor in Hittite Art from a university, who was able to correctly classify the same test items in 62.5% of cases. Precision, recall, and F1-scores are consistently higher than those obtained with statistical learning methods (Tables 3 and 4). Respective confusion matrices are presented in Appendix 2.

Table 3

Performance comparison of the models, averaged for the four classes in initial dataset: Sakçagözü, Alacahöyük, Karkamış, Arslantepe.

MODEL	TRAINING (VAL.%)	TEST%
Hybrid Model (MAML+FSL+ResNet18)	73.21	81.94
Transfer Learning (Only Pre-Trained Resnet18)	83.93	72.22
Simple CNN + FSL	58	44
Reference
Human Expert Prediction (Assoc. Prof. in Hittite Art)		62.50

Table 4

Conventional machine learning results, averaged for the four classes: Sakçagözü, Alacahöyük, Karkamış, Arslantepe. Corresponding confusion matrices are included in Appendix 2.

MODEL	TRAINING ACCURACY	TRAINING PRECISION	TRAINING F1 SCORE	TEST ACCURACY	TEST PRECISION	TEST F1 SCORE	AVERAGE CV SCORE
Support Vector Machines (SVM)	0.9167	0.9237	0.9160	0.3929	0.4008	0.3910	0.3433
K-Nearest Neighbors (KNN)	0.5370	0.6401	0.5430	0.2500	0.5094	0.2341	0.3147
Random Forest (RF)	1.0000	1.0000	1.0000	0.4286	0.4484	0.4277	0.3152
Logistic Regression (LR)	1.0000	1.0000	1.0000	0.5000	0.5893	0.5119	0.3970
Decision Tree (DT)	1.0000	1.0000	1.0000	0.2500	0.2989	0.2655	0.3325
Naive Bayes (NB)	0.7222	0.7616	0.7194	0.3571	0.3662	0.3449	0.3524

Table 4 provides comprehensive performance metrics for traditional machine learning approaches, revealing consistently poor performance with test accuracies uniformly below 50% despite achieving near-perfect training accuracies in several cases. Support Vector Machines attained 39.29% test accuracy, Random Forests reached 42.86%, and Logistic Regression achieved 50.00%—the highest among traditional methods. These results indicate severe overfitting, where models learned training data specificities rather than generalizable patterns, a predictable consequence of high-dimensional feature spaces combined with limited training examples.

The transfer learning model based on ResNet18 achieved 72.22% test accuracy with 83.93% validation accuracy, demonstrating effective adaptation of pretrained features to archaeological classification tasks. While lower than the hybrid approach, this performance still exceeded human expert classification by 9.72 percentage points, validating the effectiveness of transfer learning for archaeological applications even without meta-learning enhancements.

Three-fold cross-validation (as detailed in the Methods section; Figure 5 and Table 5) enabled the comprehensive utilization of the dataset for both training and validation, mitigating the limitations of a single train-test split. The confusion matrices across folds indicated consistent performance, with total elements summing to approximately 34, closely aligning with the expected average values derived from the dataset’s size and folds. The model’s performance metrics averaged across folds demonstrated a high degree of reliability: accuracy (86.9%), precision (79.6%), recall (80.3%), and F1 score (79.9%). The consistent results across folds highlight the model’s stability and generalizability when applied to diverse data splits, underscoring its potential for accurately classifying archaeological artifacts (Bengio, Goodfellow & Courville 2017; Alpaydin 2021).

Performance metrics for each fold according to initial dataset implemented models.

Table 5

Summary of metrics across three folds for initial dataset implemented models.

METRIC	FOLD 1 (%)	FOLD 2 (%)	FOLD 3 (%)	AVERAGE (%)
Accuracy	87.5	85.2	88.1	86.9
Precision	78.6	80.3	79.8	79.6
Recall	81.3	79	80.5	80.3
F1 Score	79.9	79.6	80.1	79.9

A three-fold cross-validation approach was selected to balance data utilization and evaluation robustness, given the small dataset size. While fewer folds (e.g., 2) may lead to high variance due to larger test sets, more folds (e.g., 4) could result in smaller training sets and less reliable metrics. Three folds allow training the model on approximately 67% of the data, which is enough to learn meaningful patterns while still reserving substantial data for validation. This choice enables meaningful performance assessment while mitigating the limitations of small datasets (Bengio, Goodfellow & Courville 2017; Alpaydin 2021).

Despite the inherent limitations of our small testing dataset, our cross-validation results demonstrate consistency across folds, achieving an average accuracy of 86.9%, higher than the results obtained by the human expert. This high score supports the idea that the optimal three folds achieve an optimal trade-off between bias and variance. This consistent performance across different data splits suggests robust generalization capabilities, indicating the model’s ability to learn underlying patterns even with limited and poor-quality examples. Importantly, the observed misclassification patterns align with established archaeological knowledge regarding artistic similarities between certain cities, providing further evidence that the model is capturing meaningful relationships within the data. While a larger test set might offer some advantages, our current methodology offers a more authentic representation of the challenges inherent in archaeological classification.

Table 6 presents disaggregated results for the hybrid model, revealing class-dependent variations in performance. Classification of Sakçagözü fragments demonstrated a strong balance between precision (0.811) and recall (0.748), leading to a robust F1-score of 0.770. However, Alacahöyük fragments exhibited the lowest performance among all categories, with a precision of 0.586 and an F1-score of 0.604, indicating challenges in distinguishing fragments from this geographical provenance. These variations suggest that different regional styles present varying levels of classification difficulty, with some possessing more distinctive visual characteristics than others.

Table 6

Performance metrics of initial dataset implemented hybrid model. Disaggregated results.

METRIC	SAKÇAGÖZÜ	ALACAHÖYÜK	KARKAMIÜ	ARSLANTEPE	AVERAGE
Training – Precision	0.811	0.586	0.861	0.779	0.759
Training – Recall	0.748	0.644	0.537	1.000	0.732
Training – F1-Score	0.770	0.604	0.649	0.876	0.725
Test – Precision	0.841	0.804	0.743	1.000	0.847
Test – Recall	0.778	0.833	0.778	0.889	0.820
Test – F1-Score	0.797	0.797	0.755	0.933	0.821

Table 7 reveals distinct performance patterns across Hittite archaeological sites that illuminate transfer learning strengths and limitations. Sakçagözü demonstrates exceptional performance with perfect test precision (1.0) but reduced recall (0.7778), indicating confident but occasionally incomplete identification. Arslantepe exhibits the most dramatic performance decline, with recall dropping from 0.8630 to 0.50, suggesting significant challenges in recognizing this regional style in unseen examples. Karkamış consistently shows the lowest performance across validation and test phases (F1-scores of 0.7079 and 0.6253), indicating that this production center presents the greatest classification challenges. These differential patterns suggest varying levels of visual distinctiveness among Hittite regional styles, with important implications for archaeological interpretation.

Table 7

Performance metrics of initial dataset implemented transfer learning model. Disaggregated results.

METRIC	SAKÇAGÖZÜ	ALACAHÖYÜK	KARKAMIÜ	ARSLANTEPE	AVERAGE
Training – Precision	0.9296	0.9259	0.7196	0.8187	0.93
Training – Recall	0.8963	0.8889	0.7148	0.8630	0.92
Training – F1-Score	0.9084	0.9063	0.7079	0.8283	0.92
Test – Precision	1	0.7222	0.5983	0.8056	0.84
Test – Recall	0.7778	0.8333	0.7778	0.50	0.83
Test – F1-Score	0.8586	0.7667	0.6253	0.5889	0.83

Upon the analysis of the simple CNN+FSL classification metrics in Table 8, the model has different accuracy and precision, depending on the output. In the case of Sakçagözü stelae, it shows robust recall coefficients (0.93 and 0.92 in classifying training data and predicting test instances, respectively) and commendable precision metrics (0.63 and 0.79), culminating in optimal F1-scores (0.75 and 0.85). Conversely, Alacahöyük Stelae are the most difficult to predict given the evidence of overfitting, as reflected in the precipitous decline in F1-scores from 0.50 to 0.16, as it is the case with Stelae interpreted as produced in Karkamış (F1-scores decreasing from 0.55 to 0.40). In this simple CNN+FSL model, Arslantepe stelae, which are very well classified and predicted with other models, show suboptimal classification and prediction efficacy (F1-scores of 0.47 and 0.32).

Table 8

Performance metrics of initial dataset implemented simple CNN + FSL. Disaggregated Results.

METRIC	SAKÇAGÖZÜ	ALACAHÖYÜK	KARKAMIÜ	ARSLANTEPE	AVERAGE
Training – Precision	0.63	0.54	0.75	0.45	0.59
Training – Recall	0.93	0.46	0.43	0.50	0.58
Training – F1-Score	0.75	0.50	0.55	0.47	0.56
Test – Precision	0.79	0.15	0.50	0.31	0.43
Test – Recall	0.92	0.17	0.33	0.33	0.43
Test – F1-Score	0.85	0.16	0.40	0.32	0.43

The performance metrics reveal significant challenges in classifying Hittite stelae from the four provenience sites, with clear evidence of overfitting and highly variable performance across different archaeological sites. This preliminary model demonstrates poor generalization, with average test performance (43% F1-score) substantially lower than training performance (56% F1-score). This 13-point drop indicates the model is overfitting to the limited training data rather than learning generalizable features that distinguish stelae from different archaeological sites. Except for the classification of Sakçagözü samples, whose high test recall rate (0.92) suggests the model rarely misses Sakçagözü stelae, the remaining are not well classified, with Alacahöyük as the worst performing classifier. The poor precision (0.15) and recall (0.17) suggest the model struggles to distinguish Alacahöyük stelae from other sites, possibly due to subtle stylistic differences or insufficient training examples. The great variation in performance between sites suggests uneven representation in the training data and that the model is memorizing training-specific features rather than learning generalizable archaeological patterns. With only 6 test images per site (24 total ÷ 4 sites), these metrics have extremely high variance. A single misclassification represents a 16.7% error rate, making the results unreliable. Results improve dramatically when we add transfer learning to the equation (Table 7).

The average performance is fairly good. The classification of training data is excellent across all sites, with F1-scores ranging from 0.88 to 0.95, indicating the model successfully learns to distinguish between different provenience sites during training. Although the testing set is small, its classification is also very good, with F1-scores above 0.80, showing the model has learned transferable features rather than memorizing training-specific patterns. The classification of samples from Sakçagözü is excellent, with a 0.92 testing F1-score, showing perfect recall (1.00) and strong precision (0.86). This consistency suggests robust feature learning for this site’s distinctive characteristics. Alacahöyük samples seem to be the worst classified, suggesting inherent classification challenges for this site’s stelae. Karkamış data achieves a solid 0.83 F1-score with balanced precision and recall. The Arslantepe samples classification shows perfect precision (1.00) but lower recall (0.83), indicating the model is conservative in its predictions but accurate when it does classify stelae from this site. These results suggest that low-level visual features learned on ImageNet (edges, textures, shapes) are highly relevant for distinguishing archaeological artifacts, even across vastly different domains. The smaller gap between training and testing data classification (0.92 vs 0.83) indicates good generalization, likely due to the pre-trained weights providing a strong initialization. Despite significant improvement, Alacahöyük remains problematic with a 0.67 F1-score, suggesting this site’s stelae may share visual characteristics with other sites or require domain-specific feature engineering.

The transfer+meta-learning consisting hybrid model (Table 6) demonstrates exceptional performance and represents a significant advancement, achieving higher and more consistent results across all Hittite stelae provenience sites. Test performance (0.821) exceeds training performance (0.725), indicating excellent generalization capabilities rather than overfitting. This counterintuitive improvement from training to testing suggests the meta-learning approach has successfully learned to adapt and generalize to new examples. Arslantepe classifications achieve near-perfect performance with a 0.933 F1-score, perfect test precision (1.00), and excellent recall (0.889). This represents the best single-site performance across all tested models. Alacahöyük classifications show dramatic improvement, achieving an 0.797 F1-score. The balanced precision (0.804) and recall (0.833) indicate the meta-learning approach has successfully addressed the classification challenges for this site. Sakçagözü results suggest the strong performance of the classificatory model, with a 0.797 F1-score, showing the model’s robustness across different site characteristics. The same can be said about Karkamış results, achieving a 0.755 F1-score with balanced metrics, representing consistent improvement over previous approaches. The model’s ability to improve from training to testing exemplifies meta-learning’s core strength: learning to learn quickly from limited examples, which is ideal for archaeological datasets with scarce labeled data. Unlike the simple pre-trained transfer model that showed variation between sites, the meta-learning model achieves more uniform performance (F1-scores ranging from 0.755 to 0.933), indicating robust feature learning. Compared to plain ResNet18 results, meta-learning shows marginal improvement in average performance (0.821 vs 0.83) but with superior consistency across sites and better generalization characteristics.

To sum up, our results prove that in Hittite art and historical studies, it is feasible to accurately predict and trace the production centers of decorated stone fragments, even when these artifacts are heavily eroded and deteriorated. The hybrid model, which integrates MAML with FSL and transfer learning techniques using the ResNet18 architecture, performs well in classifying stele fragments from four significant archaeological sites: Alacahöyük, Arslantepe, Karkamış, and Sakçagözü. Achieving an impressive classification accuracy of approximately 83%, these machine learning models effectively discern subtle stylistic and iconographic elements that reflect the distinct production techniques of each region, even with a tiny dataset.

The model’s analytical capabilities extend to the morphological and decorative motifs of the stele fragments, successfully identifying specific cultural signatures unique to each site. Although stelae from Alacahöyük, with their distinctive designs and unique stylistic attributes, may seem easier to classify and predict, we have also obtained very good results with Sakçagözü, which exhibits considerable variability in design. We have discovered that difficulties in classification do not prevent obtaining good prediction results with testing data. The performance of the different machine learning models we have compared not only advances our understanding of Hittite artistic production but also enriches provenance studies, providing a robust framework for associating fragmented artifacts with their production centers. This innovative approach marks a promising advancement in AI-based archaeological methodology, harnessing artificial intelligence to complement and extend traditional analytical techniques, thus providing opportunities for novel insights into Hittite society.

In this study, we did not investigate in detail the differences in performance of the different tools with the target types. This part of the research is still ongoing. We know that in our case study, as in others, different algorithms are more or less satisfactory depending on the specific characteristics of the input. The misclassification rate is different in each class because the stylistic features defining each provenience were different. Our results are still a black box: we know that stelae from different regions are different, and the machine knows how to distinguish them, but we researchers still do not know why. We need a deeper stylistic analysis that will go far beyond the limits of this paper, focusing on the ability of the algorithms to correctly classify and predict. We have checked that the different state of preservation of the reliefs does not affect the results, and that neither resizing nor cropping has “penalized” some images more than others. Classification results are a consequence of the intrinsic homogeneity of images related to each particular site and the heterogeneity between them.

3.2 Testing the classificatory model with additional data

To investigate how increasing the dataset size affects classification performance, particularly testing, a parallel set of experiments was conducted with our enhanced dataset (208 instances). With 14 test examples per class (56 total), these results provide more statistically robust comparisons than previous smaller datasets, increasing confidence in the observed performance patterns. Basic parameters are the same as in Table 2, to ensure comparability of results. It should be remarked that the human expert predictions were far better with this additional data, with an average of 85.7% success.

Table 9 indicates the improvement rates after enhancing the dataset’s test samples. Results from classical statistical learning are still poor, although they have improved with additional data, especially in the case of K-nearest neighbors. It is interesting to note the impossibility of generalizing a classificatory rule for Random Forest, Logistic Regression, and Decision Trees: after classifying correctly all training data, they fail when applying the learned rule to new data. results from CNN models are far better, although fairly similar to results with the smaller database, as if the new evidence has not improved results. Again, meta-learning shows better results than simple transfer learning. Results from non-pre-trained CNNs are bad, even with meta-learning. Results of transfer learning (without and with meta-learning) are far more consistent (Table 10).

Table 9

Performance comparison between initial and enhanced datasets implemented models.

MODEL	INITIAL DATASET			ENHANCED DATASET			IMPROVEMENT
MODEL	TRAINING ACCURACY	TEST ACCURACY	CV SCORE	TRAINING ACCURACY	TEST ACCURACY	CV SCORE	TEST ACCURACY
Statistical
SVM	91.67%	39.29%	34.33%	98.68%	55.36%	45.32%	+16.07%
KNN	53.70%	25.00%	31.47%	55.92%	55.36%	44.08%	+30.36%
Random Forest	100.00%	42.86%	31.52%	100.00%	55.36%	45.32%	+12.50%
Logistic Regression	100.00%	50.00%	39.70%	100.00%	57.14%	48.01%	+7.14%
Decision Tree	100.00%	25.00%	33.25%	100.00%	33.93%	35.54%	+8.93%
Naive Bayes	72.22%	35.71%	35.24%	62.50%	44.64%	42.05%	+8.93%
ANNs Based
Simple CNN + FSL	58.00%	44.00%	N/A	90.44%	43.45%	N/A	–0.55%
Hybrid Model	73.21%	81.94%	N/A	93.43%	82.74%	N/A	+0.80%
Transfer Learning	83.93%	72.22%	N/A	83.58%	75.60%	N/A	+2.71%
Reference
Human Expert Prediction on Enhanced Dataset (Assoc. Prof. in Hittite Art)		62.50%			85.70%

Table 10

Class-Specific performance of testing on enhanced dataset.

CLASS	TRANSFER LEARNING (RESNET18)			HYBRID (MAML+FSL+RESNET18)
CLASS	PRECISION	RECALL	F1-SCORE	PRECISION	RECALL	F1-SCORE
Alacahöyük	0.61	0.78	0.68	1.00	0.79	0.88
Aslantepe	0.83	0.71	0.76	0.81	0.93	0.87
Karkamış	0.81	0.64	0.72	0.71	0.71	0.71
Sakçagözü	0.93	1	0.96	0.93	1.00	0.97
Average	0.79	0.78	0.78	0.86	0.86	0.86

While the expanded dataset shows improvement, the gain is surprisingly modest considering the additional training data, suggesting the meta-learning approach was already performing near its optimal level with the limited initial dataset. However, in the case of the classification of Alacahöyük samples, the classifier benefited most from additional training examples, going from an initial F1-score of 0.797 to a 0.88 F1-score with the additional data (+10.4%). In contrast, classifications for samples coming from Arslantepe show a slight decline, suggesting the additional data may have introduced more challenging examples. The modest overall improvement (+4.7%) despite additional training data demonstrates meta-learning’s exceptional data efficiency. The initial small dataset was sufficient for the algorithm to learn effective adaptation strategies. These results suggest meta-learning approaches may experience diminishing returns from additional data, as they’re designed to learn quickly from limited examples rather than requiring large datasets. It is important to take into account:

The expanded dataset likely provides more statistically reliable results due to larger sample sizes, even if absolute performance improvements are modest.
The additional data should result in narrower confidence intervals around the performance estimates, increasing the reliability of the reported metrics.
The fact that some sites showed performance decreases with additional data suggests the meta-learning model had achieved stable generalization with the initial dataset, and additional examples may have introduced noise rather than signal.
The results emphasize that for meta-learning approaches, data quality and diversity may be more important than raw quantity, as the algorithm is designed to extract maximum learning from minimal examples.

This comparison provides strong empirical support for meta-learning’s core value proposition: achieving high performance with limited training data. The ability to reach an 82% F1-score with the initial small dataset, with only modest improvements from additional data, validates meta-learning as the optimal approach for archaeological computer vision, where data scarcity is inherent to the domain.

Table 11 presents a detailed comparative analysis for enhanced dataset predictions across all classes, showing that while the human expert achieved slightly better overall accuracy (85.7%) compared to the hybrid model (82.74%), the patterns of errors differed meaningfully between human and machine classification. Both approaches achieved perfect or near-perfect performance on Sakçagözü samples, but the hybrid model excelled with Aslantepe artifacts (92.9% accuracy) while the human expert performed better with Alacahöyük samples (85.7% accuracy).

Table 11

Comparative analysis for enhanced dataset predictions for each class on test.

CLASS	HUMAN EXPERT	TRANSFER LEARNING	HYBRID MODEL
Overall Accuracy	85.7%	75.6%	82.74%
Alacahöyük	12/14 (85.7%)	10/14 (71.4%)	11/14 (78.6%)
Aslantepe	11/14 (78.6%)	10/14 (71.4%)	13/14 (92.9%)
Karkamış	11/14 (78.6%)	9/14 (64.3%)	10/14 (71.4%)
Sakçagözü	14/14 (100%)	13/14 (92.9%)	14/14 (100%)

3.3. Human-based learning and Machine Deep Learning

In a preliminary experiment, with the initial 136 images, the human expert assigned 24 test samples to each of the 4 centers, according to his previous experience. In contrast, the different machine learning models required indexing testing data differently to perform predictions. Each model predicted the provenance of the 24 test samples by referencing 4 training folders (6 samples in each folder) with matching names.

Results are clearly unbalanced, in favor of AI-based models, which perform better than the human expert: 85.7% vs. 62.50% of precision. The highly degraded nature of some of the samples may explain the fair results obtained by the human expert.

In the case of the enhanced dataset, with 14 test examples per class (56 total), the influence of very degraded samples diminishes, and therefore, the results are better. The human expert has obtained an 85.7% accuracy. These results do not show any superiority of human over AI-based learning, and vice versa. Furthermore, the idea of consulting a human expert is to obtain some insights into the machine’s performance. The human expert was merely requested to classify the same dataset so that some insights could be gained. The expert evaluation protocol maintained experimental rigor by presenting samples in randomized order without indication of their actual provenance, preventing confirmation bias or other systematic errors that might artificially inflate performance metrics. This blind evaluation approach ensured that human expert results provide legitimate benchmarks for assessing machine learning effectiveness rather than serving merely as validation of known classifications.

Stelae from Sakçagözü, the human expert, the naked transfer model, and the transfer+meta-learning model achieve perfect or near-perfect performance (100%, 92.9%, 100%). This suggests Sakçagözü stelae possess highly distinctive visual characteristics that are easily recognizable by both human experts and AI systems. The Alacahöyük stelae have been the most difficult to identify for AI models. Human expertise provides a 7.1% advantage, suggesting subtle stylistic features that require archaeological knowledge to distinguish, and not only image features. The same can be said for Karkamış, although AI-based results are only moderately worse. In the Arslantepe case, the hybrid transfer+meta-learning model outperforms the human expert by 14.3%, suggesting the AI has learned visual patterns that even experts find challenging to consistently identify.

In the initial dataset phase, utilizing 136 samples (112 training, 24 testing), we established proof-of-concept effectiveness under the most challenging conditions archaeologists typically encounter. The human expert achieved 62.5% accuracy on the 24-sample test set, correctly classifying 15 of 24 fragments. This performance established our baseline reference for evaluating machine learning effectiveness under conditions where even expert human classification faces significant difficulties due to fragment degradation and limited diagnostic features. The enhanced dataset phase, employing 208 samples (152 training, 56 testing), revealed substantially improved human expert performance, achieving 85.7% accuracy on the expanded test set by correctly classifying 48 of 56 fragments. This dramatic improvement of 23.2 percentage points demonstrates the critical importance of comprehensive reference material availability for both human and machine-based archaeological classification, establishing a fundamental principle that applies across both cognitive and computational approaches to pattern recognition in archaeological contexts.

3.4 Additional validation

Gradient-weighted Class Activation Mapping (Grad-CAM) utilizes the gradients of any target concept (such as a specific class score) flowing into the final convolutional layer to produce visualizations. These gradients indicate the importance of each feature map for a particular decision. The technique calculates the derivative of the output of the reduction layer for a given class with respect to a convolutional feature map. Regions where this gradient is large correspond to areas where the final score depends most on the input data. The gradients are then combined and processed to create a heatmap that can be overlaid on the original image with transparency (typically 50% alpha), thereby simultaneously displaying both the original image and the areas of focus (Berganzo-Besga et al. 2022).

We employed this approach to ensure that our best-performing AI-based model focuses on genuine stylistic elements rather than background stone textures or irrelevant image artifacts. We relied on the PyTorch Grad-CAM package (https://github.com/jacobgil/pytorch-grad-cam) for AI explainability to verify that the algorithm learned correctly and was free of biases related to data acquisition. The method was implemented with a Grad-CAM threshold of 0.50 to facilitate visual interpretation (ranging from 0 to 1, where 0 indicates insignificance and 1 represents decisive influence in determining the image’s class). This visual explanation helps to elucidate the patterns used by the deep learning model to classify settlements and, consequently, to better understand Hittite stelae.

To ensure that our models classify stelae based on iconographic content rather than irrelevant background features, we defined a new model using five input classes, adding a new category labeled “background.” The training dataset contained samples in which the material characteristics of the stone support or color/brightness aberrations interfered with stylistic features. Once the best model was selected, another model was trained with the additional background class, which concentrated potential misclassifications caused by background relevance over stylistic motifs. The new training dataset contained 155 images, with 31 per class (including the background class).

In a preliminary experiment, using data trained without the additional fifth class (Figure 6), the heatmaps revealed that the model primarily focused on specific engraved features and relief patterns rather than background areas or stone texture alone. The red/orange activation zones consistently highlighted carved figures and reliefs. Where these features were visible, the model showed strong activation on human figures, animals, and decorative motifs, capturing unique stylistic features that likely distinguish different cultural origins. The model appeared to be learning culturally specific artistic styles rather than relying on irrelevant features such as lighting or background. Similar types of engravings (e.g., human figures, geometric patterns) received comparable attention across samples. Nevertheless, some heatmaps displayed scattered activation rather than coherent focus on complete motifs. In certain images, edge effects were observed, where activations appeared concentrated on artifact boundaries, which may indicate that the model was learning shape rather than content. Better-preserved areas received more attention, potentially overlooking diagnostic features in damaged regions.

Grad-CAM, Guided Backpropagation, and Guided Grad-CAM applied on the trained model without the background class.

When re-training data with the fifth class, the addition of a stone background class appears to have significantly improved the model’s ability to distinguish between carved content and substrate material. Several improvements are evident (Figure 7 and Table 12).

Table 12

Comparison of performance metrics of nested CV models with and without the background class.

	BEST TRAINING MEAN ACCURACY	TESTING MEAN ACCURACY
4-class model	0.9275	0.7778
5-class model	0.8059	0.6518

Grad-CAM, Guided Backpropagation, and Guided Grad-CAM applied on the trained model with the background class.

The heatmaps show:

More focused activations on engraved features rather than stone texture variations.
Reduced noise from natural stone patterns and surface irregularities that previously may have confused the classifier.
Cleaner attention maps with less scattered activation across non-diagnostic areas
Sharper boundaries around carved elements.
Better figure-ground separation. The model now more effectively ignores background stone variations while highlighting cultural artifacts.
Consistent attention patterns across different preservation states and lighting conditions.

While improved, some challenges persist:

Fragmented activations still occur in heavily weathered areas.
Edge effects remain visible where carved and uncarved regions meet.
Variable attention intensity across different artifact preservation states.

The comparison between four-class and five-class model attention patterns proved particularly revealing. While the four-class model occasionally exhibited attention to stone surface characteristics alongside iconographic elements, the five-class model with background class showed markedly more focused attention on carved depictions. This pattern suggests that the inclusion of negative examples through the background class, despite reducing overall accuracy, produced more archaeologically valid classification strategies that better align with human expert methodology.

The developed pre-trained ResNet18 model was evaluated with a nested cross-validation approach (outer and inner k = 3) due to the small number of samples instead of using the traditional hold-out method where the dataset is divided into training, validation and testing datasets. This approach, although more computationally expensive, guarantees greater reliability and stability of accuracy. Therefore, it is optimal for limited datasets (Vabalas et al. 2019).

Grad-CAM analysis suggests that the observed performance reduction may result from the integration of additional features and, consequently, greater original variation. The background class effectively functioned as a negative control, compelling models to develop precise attention mechanisms focused on iconographic content rather than relying on broader visual patterns that might include non-diagnostic features. Grad-CAM visualizations across all four archaeological site classes confirmed that our models successfully learned to attend to carved depictions and iconographic elements rather than irrelevant photographic or stone surface characteristics. Heatmap analysis revealed concentrated attention on areas containing human figures, symbolic representations, and decorative motifs—precisely the features that human experts examine when determining stele provenance. Importantly, visualizations demonstrated minimal activation in background regions, photographic edges, or plain stone surfaces, confirming that models avoided learning spurious correlations based on imaging conditions or non-cultural features.

The incorporation of a fifth output class (“background”) provides an additional validation of the method’s potential for achieving more accurate classifications. Current models may confuse stelae from different sites that share similar stone materials, leading to inflated error rates that could be corrected by explicitly incorporating material variation effects. Sites that previously exhibited lower precision scores (such as Alacahöyük in earlier results) may show substantial improvements as the model learns to distinguish stylistic features from geological characteristics. This directly addresses a core challenge in archaeological classification, where material properties can obscure cultural patterns.

By explicitly accounting for stone background effects, the model’s decisions become more archaeologically meaningful, focusing on genuine cultural and stylistic differences rather than geological coincidences.

4. Conclusion

Our study intended to evaluate the real possibilities offered by machine learning tools in archaeology, given the challenges posed by usual archaeological data, characteristically limited in quantity and perhaps even in diversity, and frequently lacking strict standardization. The explicit approach adopted in our investigation represents a significant departure from idealized experiments, situating our investigation at the pragmatic intersection of computer science and archaeological practice, where the ultimate value of artificial intelligence lies not in its performance under optimal conditions but in its capacity to function effectively within the inherent limitations and variability that practicing archaeologists encounter in their documentary and analytical workflows, thus providing insight into whether these computational approaches offer substantive advantages when confronted with the characteristic challenges of archaeological data.

Among the usual strategies that we can suggest for overcoming the challenges posed by the small size of datasets, there are:

Feature selection: We have carefully post-processed images to reduce dimensionality and improve model performance.
Flattening images: For small datasets, flattening images and using traditional machine learning algorithms can be more effective. These algorithms can generalize well with fewer data points, whereas Convolutional Neural Networks may require large amounts of data to avoid overfitting. However, the use of flattened images is not always recommendable given the impossibility of classifying spatial patterns of pixels within the image.
Remove outliers: We have identified and eliminated possible outliers.
Data augmentation: Although in some cases training datasets can be expanded using synthetic samples or by combining data from other potential sources, we have avoided this mechanism. Data augmentation methods, while seemingly a solution, can distort the true representation of the data and lead to overly optimistic performance estimates that don’t translate to real-world applications. Furthermore, our test set deliberately includes the common archaeological challenges of suboptimal archive photography and non-standard and poor-quality visual data. These real-world limitations cannot be easily bypassed with data augmentation, as such techniques risk introducing further bias and disrupting the delicate bias-variance balance crucial for robust model performance.
Transfer learning: We have developed a general model on large, readily available datasets (ResNet18) before fine-tuning it on our small dataset.
Regularization: We have discretely applied regularization techniques like weight constraints and early stopping in the validation step to keep models more conservative and prevent overfitting.
Ensemble methods: We have combined multiple models to improve overall performance and reduce overfitting.
Active learning: We have minimized the amount of labeled data required by selectively choosing the most informative samples for labeling.
Cross-validation: We have used appropriate cross-validation techniques to ensure reliable model evaluation despite limited data.
Domain expertise: We have leveraged subject matter knowledge throughout the modeling process to guide feature selection and model interpretation.

The positive results achieved by our study, after being trained with only a very small dataset, illustrate the transformative impact of integrating cutting-edge AI technologies with archaeological expertise. Statistically based classical machine learning methods are heavily affected by the size of the training set, with misclassification errors near 50% or even higher. However, ANN-based approaches proved far more resilient. Our findings demonstrate that the combination of MAML, FSL, and transfer learning (our hybrid model) yielded the strongest results. This hybrid approach consistently demonstrated superior generalization and prediction capability, achieving 81.94% testing accuracy on the initial dataset (as shown in Table 3) and 82.74% on the enhanced dataset (Table 9). This performance not only surpassed the standard pre-trained CNN (ResNet18) model but also showed a more robust F1-Score of 0.86 on the enhanced dataset (Table 10). In this case, our results show that hybrid learning classifies the training and testing dataset better overall. Both AI methods, hybrid and transfer significantly outperformed conventional machine learning algorithms and even surpassed human expert classification, which achieved 62.5% (in initial) and 85.7% (in enhanced) accuracy. The advantage of meta-learning and transfer-learning approaches over classical statistical learning methods lies in their ability to learn effectively from small datasets, such as those we have in archaeological research. Furthermore, the flexibility of these methods allows for the easy integration of new optimization algorithms, potentially increasing model performance without the need to develop entirely new models for each situation.

Image-based datasets can limit the processing capabilities of classical machine learning algorithms. In this context, “overfitting” was observed in some statistical discriminant models, while “underfitting” occurred in others. Additionally, certain examples within the dataset were not adequately processed or considered, despite their presence. Distance-based algorithms, such as Support Vector Machines (SVM) and K-Nearest Neighbors (KNN), encounter significant challenges when working directly with multidimensional data. If the models have not undergone explicit flattening or feature extraction processes, they struggle to effectively process the data, often resulting in low accuracy outcomes. Although some models in this study were provided with images in their cropped form, the performance of conventional distance-based models did not yield promising results as anticipated. This lack of success can be attributed to the inherent limitations of these algorithms when handling high-dimensional image data directly without appropriate preprocessing steps. In any case, the aim was to demonstrate the obvious success of other methods in solving the problems we were addressing.

Although limited in size, our testing dataset has been explicitly built to give equal relevance to the four proveniences tested. The test dataset deliberately includes the common archaeological challenges of suboptimal archive photography and non-standard and poor-quality visual data. Despite these inherent data limitations, our cross-validation results demonstrate remarkable consistency across folds, achieving an average accuracy of 86.9%. This consistent performance across different data splits suggests robust generalization capabilities, indicating the model’s ability to learn underlying patterns even with limited and poor-quality examples. Importantly, the observed misclassification patterns align with established archaeological knowledge regarding artistic similarities between certain cities, providing further evidence that the model is capturing meaningful relationships within the data. While a larger test set might offer some advantages, our current methodology offers a more authentic representation of the challenges inherent in archaeological classification.

Archaeologists rarely encounter pristine, abundant comparative material in the field. Each new fragment presents a unique classification challenge, often requiring expert judgment based on limited precedent. Our approach directly simulates these real-world conditions, providing a more realistic and practical assessment of how these models would perform when deployed in actual archaeological research. While we could potentially expand the test set to 15–20 samples per city from our original 412 cropped samples, this study serves as a proof of concept, demonstrating the potential of meta- and transfer learning in addressing the specific challenges of archaeological data, especially in the context of FSL.

Our findings validate the particular value of meta-learning approaches designed for few-shot scenarios, which proved remarkably effective at extracting meaningful patterns from limited examples. The hybrid model’s ability to achieve over 90% validation accuracy with only 112 training samples demonstrates that these approaches can function effectively within the data constraints that typically prevent implementation of computational methods in archaeology.

More intriguingly, our analysis of human expert performance patterns provides crucial insights into the complementary nature of human and machine intelligence in archaeological classification. The human expert’s accuracy improved substantially from 62.5% to 85.7% between initial and enhanced datasets, demonstrating the fundamental importance of comprehensive reference collections for both cognitive and computational pattern recognition. However, the detailed class-specific analysis presented in Tables 10 and 11 reveals that human and machine approaches exhibit distinctly different error patterns, with machines excelling in certain regional styles while humans demonstrate superior performance in others.

Perhaps the most critical contribution of our research lies in demonstrating that high-performing machine learning models base their decisions on archaeologically meaningful features rather than spurious correlations or photographic artifacts. The Grad-CAM visualizations presented in Figures 6 and 7 provide compelling evidence that our models learned to focus attention on carved iconographic elements, symbolic representations, and decorative motifs—precisely the features that human experts examine when determining stele provenance (Selvaraju et al. 2020; Berganzo-Besga et al. 2025).

The trade-off analysis between four-class and five-class models reveals a fundamental insight into optimizing AI systems for archaeological applications. While the inclusion of a background class reduced raw classification performance from 92.75% to 80.59% validation accuracy, it produced classification behavior more closely aligned with human expert methodology. This finding suggests that optimal AI performance in archaeological contexts may not correspond to maximum accuracy on traditional metrics, but rather to systems that make decisions based on culturally appropriate features.

This validation of archaeological authenticity addresses one of the most significant barriers to scholarly acceptance of AI methods in cultural heritage research. By demonstrating transparency in decision-making processes and alignment with disciplinary knowledge, our approach provides a template for developing AI systems that can enhance rather than threaten the methodological rigor that characterizes high-quality archaeological research.

Our case study, although limited and very restricted to a particular kind of material, not only validates the application of machine learning methods in archaeology but also sets a precedent for future studies aiming to tackle similar classification challenges in other domains of cultural heritage. The success of these models in classifying highly altered, poorly preserved, and fragmented artifacts demonstrates their potential in real-world archaeological scenarios, where perfect specimens are rare.

The results suggest human experts and AI systems have complementary capabilities rather than one being universally superior. The hybrid model’s superiority for Aslantepe, while humans excel for Karkamış, indicates different pattern recognition strengths. This variation across sites suggests that different archaeological contexts may benefit from different approaches, some favoring human expertise, others benefiting from AI pattern recognition. This comparison represents a landmark achievement in archaeological computer vision, demonstrating that sophisticated AI approaches can approach and occasionally exceed human expert performance in specialized classification tasks. The results validate the potential for AI-assisted archaeological research while highlighting the continued importance of human expertise for complex stylistic analysis. The near parity between human and AI performance in the enhanced dataset suggests we are approaching a threshold where AI becomes a practical tool for archaeological classification, even with markedly scarce and inadequate datasets, opening new possibilities for large-scale analysis of archaeological collections.

Additional Files

The additional files for this article can be found as follows:

Supplementary Material 1

Appendix 1. DOI: https://doi.org/10.5334/jcaa.196.s1

Supplementary Material 2

Appendix 2. DOI: https://doi.org/10.5334/jcaa.196.s2

Acknowledgements

This study has been realized in the context of the PhD Program in Prehistoric Archaeology at the Universitat Autònoma de Barcelona.
We also acknowledge the help of the Ankara Anatolian Civilizations Museum for data access, especially to Mr. Umut Alagoz, assistant director, and to the Hittite expert Mine Çiftçi.
Thanks to Assoc Prof. Alessandra Gilibert from Università Ca’ Foscari Venezia, Venice (UNIVE)/Department of Humanities for her collaboration.
Thanks to Carles Boned, Adrià Molina (Universitat Autònoma de Barcelona, Computer Science Department), Rosario Delgado (Universitat Autònoma de Barcelona, Mathematics Dept.), and Mehmet Anıl Akbay (Artificial Intelligence Research Institute/IIIA-CSIC in UAB Campus) for comments about a preliminary version of this paper.
The study’s preliminary version, including the first results, was presented at 5th CAA-GRE Conference (Computer Applications and Quantitative Methods in Archaeology Greece Section), 16–17 April 2024. Serres, Greece (Oral Presentation).
Funding has been made possible thanks to projects PID2019-109254GB-C21 and PID2023-150978NB- C21 from the Spanish Ministry of Science, Innovation and Universities, and 2021 SGR 00190 from Agencia de Gestió d’Ajuts Universitaris i de Recerca (Generalitat de Catalunya).
Kernels can be provided upon request. Figures were made on (c) Canva and flowcharts in (c) Miro.
We thank the editor and two anonymous reviewers from JCAA for an exhaustive review of the paper that contributed to adding clarity and understandability.
Remaining errors and misunderstandings are the responsibility of the authors alone.

Declaration of Generative AI and AI-Assisted Technologies in the Writing Process

Statement: During the preparation of this work, the authors (Deniz Kayikci) used CLAUDE AI/ANTHROPIC to control and fix the main code blocks in the computer programs used. After using this tool/service, the authors reviewed and edited the content as needed and have taken full responsibility for the content of the publication.

Competing Interests

The authors have no competing interests to declare. Any of the listed authors are currently members of the journal’s editorial team or board, nor any of them has held such a position within the last three years.