1 Context and motivation
Vision-Language Models (VLMs) are increasingly being adopted in teaching and research infrastructures to enhance the textual discovery of visually related or similar objects. This trend is particularly evident in the field of art history. Unlike traditional metadata-based approaches, where an object is represented only through textual descriptors curated by humans,1 these methods address an object directly through its visual structure, enabling content-aware searches to operate at the image level. For instance, since 2021, the web application imgs.ai has offered a multimodal search engine for art-historical image collections, employing state-of-the-art VLMs such as CLIP (Contrastive Language-Image Pre-training; Radford et al., 2021) to link textual and visual representations (Offert & Bell, 2023);2 similarly, Google Arts & Culture has experimented with VLM-assisted retrieval.3 VLM-supported techniques have also begun to enter museum practice (e.g., Mazzanti et al., 2025). In such contexts, however, the focus often shifts from scholarly analysis to exploratory or playful modes of engagement, framed by a “gamification” of search. These developments highlight a methodological shift: VLM-based methods promise to supplement—or even replace—identity-based searches that depend on manually curated textual metadata. It is their ability to provide more fluid, intuitive access to art-historical objects, without the categorical restrictions imposed by human annotators that makes them appealing. However, as Impett & Offert (2022) argue, VLMs such as CLIP do not merely retrieve images—they encode and project a “machine visual culture,” embedding culturally situated biases and imaginaries that determine what is recognized as similarity. These models are implicitly ekphrastic: they learn about images through text and about text through images, structuring one medium in relation to the other; their embeddings capture not only object-level similarities but also para-visual concepts of arbitrary complexity. At the same time, the large-scale multimodal datasets on which models like CLIP are trained incorporate problematic material—from Eurocentric stereotypes to non-consensually harvested images—that becomes irreversibly absorbed into the models’ internal representations. In this sense, VLMs operate within what Pasquinelli & Joler (2021) describe as a regime of “knowledge extractivism.” Machine learning, they argue, is not so much “intelligent” as it is an apparatus of “automation of perception”—compressing the world into statistical models that magnify certain patterns while systematically obscuring others.
For art history, the emergence of VLMs represents both an opportunity and a methodological challenge. On the one hand, VLMs offer genuinely new opportunities of searching for visual motifs across collections that resist traditional classification; on the other hand, their results are difficult to verify and explain because their underlying mechanisms are obscured by high-dimensional latent spaces. Search engines built on manually curated metadata, by contrast, consolidate decades of interpretive, contextual, and historical expertise—through cataloging systems, controlled vocabularies, and linked data infrastructures such as Wikidata.4 Unlike statistical embeddings, curated metadata is auditable: it can be interrogated, corrected, and expanded, and it carries with it an explicit genealogy of curatorial and scholarly decisions. This tension between machine-constructed embeddings and human-constructed metadata raises a fundamental question: How can VLMs expand the repertoire of digital art history without displacing its interpretive foundations? In this paper, we argue that VLM-driven search can indeed expand the repertoire of digital art history, but it cannot supplant infrastructures built on manually curated metadata. Its potential, rather, emerges in conjunction with them. We propose a hybrid model in which these two knowledge systems—the curated, human-constructed, and the machine-constructed, extractive—are brought into productive alignment. Specifically, we integrate VLM-based retrieval with metadata-driven faceting, using Wikidata as a data provider for art-historical research. Our contributions are threefold:
We outline a framework for art-historical retrieval that links VLM-derived embeddings to structured knowledge graphs (such as Wikidata). Faceting serves as a central organizing principle in this hybrid environment: it organizes and navigates large information spaces by using structured metadata dimensions, or facets, to classify different aspects of collection items (see, e.g., Yee et al., 2003). Unlike keyword-based retrieval, faceting enables users to filter or recombine results iteratively along multiple attributes—such as creator, period, or material—making visible the multidimensional structure of a collection’s metadata.
Using Large Language Models (LLMs), we present and evaluate a pipeline that derives structured, triplet-based assertions of visual concepts (i.e., entities depicted in artworks), and aligns them with existing metadata. Although Wikidata offers detailed structured metadata, its coverage of iconographic concepts and their relations remains fragmentary; in addition, it is often encoded only through the classification system Iconclass (van de Waal, 1973–1985; van Straten, 1994). To illustrate the necessity for a more explicit iconographic representation, consider Peter Paul Rubens’s Perseus Freeing Andromeda (c. 1621). On Wikidata, the property “depicts Iconclass notation” (P1257) is associated with the notation 94P212 (Perseus frees Andromeda of her chains).5 This notation condenses several iconographic components—Perseus, Andromeda, chains, the act of rescue—yet none are represented as discrete knowledge-graph assertions, remaining inaccessible to computational retrieval and reasoning.
We realize this approach in iART, a retrieval environment combining multimodal search with semantic filtering (Schneider et al., 2022; Springstein et al., 2021).6 The iART project, funded by the German Research Foundation (DFG) from 2019 to 2021, is a collaboration between the Chair of Medieval and Modern Art History at Ludwig Maximilian University of Munich and two research groups: the “Visual Analytics” group at TIB – Leibniz Information Centre for Science and Technology, and the “Intelligent Systems and Machine Learning” group at the Heinz Nixdorf Institute of the University of Paderborn. For the present article, a new front end was developed to enhance standard art-historical retrieval processes by making the inherent biases in the employed datasets more transparent.
2 Dataset description
The obtained dataset comprises a total of 36 043 artworks. In addition to the 16 Wikidata properties most relevant for art-historical inquiry, the corpus is further enriched to broaden its semantic and iconographic scope. Specifically, it incorporates 223 271 concepts, 104 843 tuples, and 104 886 triplets derived from Iconclass notations using Qwen3 (Yang et al., 2025)—the LLM identified, with the prompt template detailed in Appendix B.1, as the best-performing model in Section 4. Comprehensive statistics, including the Wikidata properties retrieved and their respective frequencies, are provided in Table 1. The methodological framework underlying the dataset’s construction is detailed in Section 3.
Table 1
Results of the data acquisition process, showing the properties retrieved from Wikidata and their absolute frequencies.
| IDENTIFIER | NAME | DESCRIPTION | FREQUENCY |
|---|---|---|---|
| P18 | image | image of relevant illustration of the subject […] | 36 179 |
| P1476 | title | published name of a work, such as a newspaper article […] | 29 077 |
| P170 | creator | maker of this creative work or other object […] | 33 616 |
| P571 | inception | time when an entity begins to exist […] | 35 588 |
| P31 | instance of | type to which this subject corresponds/belongs […] | 37 393 |
| P2048 | height | vertical length of an entity | 35 276 |
| P2049 | width | width of an object | 35 268 |
| P186 | made from material | material the subject or the object is made of or derived from […] | 66 651 |
| P135 | movement | literary, artistic, scientific or philosophical movement or scene […] | 2668 |
| P136 | genre | creative work’s genre or an artist’s field of work […] | 33 741 |
| P180 | depicts | entity visually depicted in an image, literarily described in a work […] | 62 848 |
| P921 | main subject | primary topic of a work or act of communication | 13 253 |
| P195 | collection | art, museum, archival, […] of which the subject is part […] | 56 189 |
| P276 | location | location of the object, structure or event […] | 41 557 |
| P17 | country | sovereign state that this item is in […] | 5712 |
| P495 | country of origin | country of origin of this item | 1210 |
Repository location
Repository name
Zenodo
Object name
data.jsonl, images.zip, and prompt.txt
Format names and versions
The images are stored in a ZIP file structured into directories named by the first two characters of each image’s hash_id. Within these directories, subfolders named by the next two characters of the hash_id contain the image files, which are named using their full hash_id with a .jpg extension. The annotation data is provided in a JSONL file, where each line encodes metadata for a single image. The prompt template that was leveraged to get the predictions of the respective version, is given under prompt.txt.
Creation dates
2025-09-29 to 2025-10-25
Dataset creators
Stefanie Schneider (Institute of Art History, Marburg University, Marburg, Germany); Matthias Springstein (TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany); Julian Stalter (Marburg Center for Digital Culture and Infrastructure, Marburg, Germany); Daniel Ritter (L3S Research Center, Leibniz University Hannover, Hannover, Germany); Ralph Ewerth (TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany; L3S Research Center, Leibniz University Hannover, Hannover, Germany; Marburg University and Hessian Center for Artificial Intelligence (hessian.AI), Marburg, Germany); Eric Müller-Budack (TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany).
Language
English
License
CC BY-SA 4.0
Publication date
2025-10-26
3 Method
In this section, we introduce a three-step pipeline that involves: (1) collecting images and structured artwork metadata from Wikidata (Section 3.1); (2) extracting triplet-based assertions of iconographic concepts from Iconclass notations using LLMs (Section 3.2); and (3) combining these complementary textual and visual representations within a multimodal retrieval framework (Section 3.3).
3.1 Data acquisition from Wikidata
We begin by harvesting from Wikidata, focusing on art-historical objects enriched with both visual representations and semantic annotations. Using the Wikidata SPARQL endpoint,7 we extract objects that are instances of either “visual artwork” (Wikidata item Q4502142) or “artwork series” (Q15709879). To ensure a meaningful dataset, we require each object to contain (1) at least one two-dimensional digital image (P18) and (2) at least one Iconclass notation (P1257).8 For each retrieved object, we collect a broad set of descriptive and contextual properties that later serve as retrieval facets. These include: “image” (P18), “title” (P1476), “creator” (P170), “inception” (P571),9 “instance of” (P31), “width” (P2049), “height” (P2048), “made from material” (P186), “movement” (P135), “genre” (P136), “depicts” (P180), “main subject” (P921), “depicts Iconclass notation” (P1257), “collection” (P195), “location” (P276), “country” (P17), and “country of origin” (P495).
3.2 LLM-based concept and relation extraction
In the second step, the harvested metadata are semantically enriched through the derivation of structured iconographic representations based on Iconclass notations.
Task definition
Each Iconclass definition is uniquely identified by an alphanumeric code, referred to as the ‘notation,’ hereafter denoted as N, associated with a human-readable description DN. Given a description DN, the objective is to extract (1) the set of relevant art-historical concepts and (2) their semantic relations, represented as triplets , where each Ti = ⟨subject ∈ , predicate, object ⟩ with object≠subject. For instance, in Peter Paul Rubens’s Perseus Freeing Andromeda, the Wikidata property “depicts Iconclass notation” (P1257) is associated with the notation 94P212, corresponding to the description DN = Perseus frees Andromeda of her chains.10 From this description DN, we can derive the concept set = {Perseus, Andromeda, chains} and the relations = {⟨Perseus, frees, Andromeda⟩, ⟨Andromeda, has, chains⟩}. These structured assertions distill the iconographic relationships implied by the Iconclass description into a machine-readable form suitable for computational retrieval.
Methodology
To solve the aforementioned task, we employ pre-trained LLMs such as LLaMA (Touvron et al., 2023) and Qwen3 (Yang et al., 2025), which have shown state-of-the-art performance for many downstream applications including structured information extraction (Dunn et al., 2022). More specifically, given an Iconclass description DN, the LLM is instructed to output the concept set and their relations as triplets using a suitable prompt strategy P. For this purpose, we propose three prompting techniques designed to guide the LLM towards generating outputs in a structured JSON format:11
(CoT) Chain-of-Thought Prompting: This approach enables the LLM to reason and solve the task through clearly defined intermediate steps. In practice, such explicit reasoning often achieves better performance in comparison to short, minimal instructions (Wei et al., 2022).
(Role-CoT) Role-playing Chain-of-Thought Prompting: Here, the LLM assumes defined expert personas. In our case, we employ a team of three experts to solve the task: an art historian who analyses the iconographic meaning, a visual analyst who considers spatial and compositional relationships between depicted concepts, and a semantic modeler who formalizes these relations into valid triplets (Appendix B.2). This can enhance the semantic coherence of extracted entities and relations (Wang & Luo, 2023).
(ReAct) Reason-and-Act Prompting: Combining the thought-driven approach of Chain-of-Thought (CoT) with explicitly delineated action steps, this prompting technique couples reasoning and execution in an iterative loop, thereby potentially yielding more accurate results than CoT prompting (Yao et al., 2023).
Based on the response in JSON format, we extract the predicted concepts and relation triplets . However, in some cases, LLMs might create inconsistent responses, i.e., they could extract a concept in the relations—as subject or object—that is not present in the concept list. To resolve this issue, we use the union of all concepts in the concept list and the subjects and objects in the triplet relations.
Post-processing steps
We employ a series of post-processing steps on the model responses to ensure that the resulting JSON output is valid for automatic parsing. For example, any extraneous text preceding or following the JSON output is removed, including the outputs of the thought process produced by models such as Qwen3 (Yang et al., 2025). However, in some cases, the model could still generate malformed or incomplete JSON (e.g., triplets with missing fields). To maintain data quality, we automatically discard all responses that fail to meet the JSON validity requirements.
3.3 Art-historical image retrieval
In the final stage, we combine textual and visual information within a multimodal retrieval environment. Each object image is processed using SigLIP 2, which encodes cross-modal embedding vectors that capture shared visual-semantic features from both image and text modalities (Tschannen et al., 2025). The resulting embeddings are indexed in a Qdrant vector database, enabling efficient approximate nearest-neighbor search.12
These multimodal representations—comprising both visual embeddings and associated textual metadata—are then integrated into iART, an open-source retrieval framework that combines semantic filtering with embedding-based similarity search (Schneider et al., 2022; Springstein et al., 2021).13 On the client side, iART presents itself as a web-based GUI implemented in Vue.js.14 The interface supports faceted browsing (e.g., by creator, genre, or movement), while simultaneously leveraging embedding-based similarity to navigate visually and semantically related objects. The overall layout follows a split-view structure, reminiscent of established digital heritage collections and scholarly repositories (Figure 1a).15 On the left, users interact with expandable facet panels that serve as dynamic filters, enabling a structured yet exploratory approach to search. To support critical inspection, the system integrates contextual visualizations into descriptive overviews—for instance, histograms exposing distributions of numerical metadata, thus enabling the rapid detection of distortions or irregularities. In the top-left corner, the active query is condensed into a readable summary, allowing users to track their evolving search trajectory. For more granular control, a modal dialog provides advanced configuration options: users may choose between textual or visual search inputs and specify the models or datasets to be applied, enabling transparent methodological choices within the research workflow (Figure 1b). On the right, the GUI presents the search results in a responsive image grid, optimized for visual browsing and comparison. Each item appears as a thumbnail, maximizing information density while maintaining visual clarity.

Figure 1
Front-end design of iART. (a) On the left side of the GUI, users can access facets organized in expandable panels, while the right side displays the search results in an image grid. (b) For nuanced control of the search results, a modal dialog provides advanced configuration options.
4 Results and discussion
In this section, we assess the effectiveness of our pipeline for deriving structured, triplet-based assertions from Iconclass descriptions in Wikidata using LLMs. We describe our experimental setup, including the metrics employed to quantify model performance, and present the results obtained through both quantitative and qualitative analyses.
4.1 Experimental setup
Ground-truth dataset
The ground truth was obtained through an annotation study conducted by a trained art historian. For each of the nine primary iconographic classes (as defined by the first digit of the Iconclass notation N), the annotator was given up to 10 Iconclass descriptions DN and was instructed to: (1) identify all relevant art-historical concepts and (2) specify the corresponding relations in the form of triplet statements = ⟨subject” ∈ predicate, object ∈ ⟩. In total, the dataset comprises 224 concepts and 160 triplets derived from 85 Iconclass descriptions. This includes 26 examples without any valid relations (e.g., 11L313, the Nicene creed), which serve as negative samples for relation extraction.
Evaluation setup and metrics
To evaluate the proposed approach, we use the ground-truth dataset to assess model performance in terms of precision, recall, and F1-score. Since the number of annotations (i.e., concepts, tuples, and triplets) can vary between Iconclass descriptions, we report both micro scores (i.e., average performance over all individual annotations) and macro scores (i.e., the average of the performance computed separately for each Iconclass description) of these metrics to provide more detailed insights. The advantage of the macro score is that it also accounts for Iconclass descriptions that contain no annotations, i.e., negative samples in the ground truth. This is essential for realistic evaluation, as many Iconclass notations do not include concept relations (e.g., 11L313, the Nicene creed). For such cases, we assign recall, precision, and F1-scores of 1.0 when the model correctly predicts an empty set, and 0.0 otherwise.
Given an Iconclass description DN associated with an Iconclass notation N, the metrics are computed based on the number of matches—i.e., correctly aligned predictions between the model and the ground-truth annotation—for the following subtasks:
Concept extraction, using the extracted and ground-truth concepts ;
Tuple extraction, using the extracted = ⟨concept1 ∈ , concept2 ∈ ⟩ and ground-truth concept relations = ⟨concept1GT ∈ , concept2GT ∈ ⟩. The tuples are derived by excluding the relation, i.e., predicate from the corresponding predicted and ground-truth triplets;16
Relation extraction, using the predicted = ⟨subject ∈ , predicate, object ⟩ and ground-truth concept relations = ⟨subjectGT ∈ , predicateGT, objectGT ⟩, including their relation as predicate.
As the number of matches is evaluated based on the correctly aligned predictions between the model and the ground-truth annotation, terminological variation is to be expected: conceptually identical entities may differ at the lexical level (e.g., Saul’s Court vs. sauls court) or in relational phrasing (represented in vs. in). To mitigate these discrepancies, we propose an LLM-as-a-judge approach. All strings in both predictions and the ground truth are first lowercased; an LLM then determines whether two strings are semantically equivalent. As detailed in Section 4.3, we evaluated several LLMs and prompting strategies by comparing their judgments with those of a human expert (the same annotator who produced the ground truth), and measured inter-coder agreement (Appendix C). Based on these results, we chose Gemma 3 (Kamath et al., 2025) using a chain-of-thought prompt (Appendix B.4) for all subsequent experiments, as it achieved a Cohen’s κ of 0.455, which suggests moderate agreement with the human judge (Landis & Koch, 1977).
Models and implementation details
The evaluation is conducted across three state-of-the-art LLMs using an Ollama instance (0.11.10)17 and the prompting strategies CoT, Role-CoT, and ReAct, as introduced in Section 3.2. The models under consideration are: (1) Qwen3 with 8 billion parameters from Alibaba Cloud (Yang et al., 2025); (2) Gemma 3 with 4 billion parameters from Google (Kamath et al., 2025); and (3) LLaMA 3.1 with 8 billion parameters from Meta (Dubey et al., 2024). Each model is instructed to produce outputs for both concept and relation extraction; the final predicted concept set is defined as the union of concepts extracted across these two tasks. We aimed to ensure deterministic outputs by setting the temperature to 1e–10≈0. Nonetheless, occasional non-deterministic outputs were observed for identical prompts. To account for this variability, we report the average results over three iterations; corresponding standard deviations are provided in Appendix D.
4.2 Quantitative results
Ground-truth dataset
The quantitative results for all tasks and model configurations are summarized in Table 2. Among the evaluated models, Qwen3 with CoT prompting achieves the highest F1-scores, consistently outperforming Gemma 3 and LLaMA 3.1 across all tasks and metrics under comparable prompting strategies. At the same time, the results highlight substantial variation between configurations. As shown in Table 3, LLaMA 3.1 tends to produce fewer valid extractions than Qwen3 and Gemma 3—often by a substantial margin—largely irrespective of the prompting strategy employed. Thus, the performance is expected to be worse. This issue typically arises under one of two conditions: (1) the model output cannot be fully parsed, preventing triplet extraction, as with LLaMA 3.1 and CoT; or (2) the extracted triplets are incomplete—lacking a subject, predicate, or object—and are thus excluded from evaluation. Qwen3 and Gemma 3 produce comparable micro results, particularly for concept extraction. For both LLMs, the CoT prompt achieves the best results. However, while Qwen3 achieves even better macro results, the corresponding results of Gemma 3 decrease significantly. This suggests that Qwen3 can better predict the absence of any relations in an Iconclass description, as these are specifically handled by the macro metrics (see Section 4.1). This is also supported by Table 3 where Gemma 3 only rarely return empty relation sets. Overall, we can conclude that Qwen3 produces the most robust results and is capable of handling negative examples, which is very important for real-world scenarios as many Iconclass descriptions do not contain concept relations.
Table 2
Mean results for concept extraction, tuple extraction, and relationship extraction in terms of precision (P), recall (R), and F1-score across different LLMs and prompting strategies. The averages are calculated over three iterations. The best-performing approach is indicated in bold.
| LLM | PROMPT | CONCEPT EXTRACTION | TUPLE EXTRACTION | RELATION EXTRACTION | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | ||
| Micro | ||||||||||
| Qwen3 | CoT | 0.808 | 0.859 | 0.833 | 0.721 | 0.725 | 0.723 | 0.468 | 0.471 | 0.469 |
| Role-CoT | 0.560 | 0.706 | 0.624 | 0.384 | 0.503 | 0.435 | 0.206 | 0.271 | 0.234 | |
| ReAct | 0.698 | 0.671 | 0.684 | 0.539 | 0.508 | 0.523 | 0.357 | 0.343 | 0.350 | |
| Gemma 3 | CoT | 0.776 | 0.857 | 0.815 | 0.639 | 0.665 | 0.652 | 0.422 | 0.448 | 0.435 |
| Role-CoT | 0.571 | 0.782 | 0.660 | 0.379 | 0.497 | 0.430 | 0.150 | 0.237 | 0.184 | |
| ReAct | 0.757 | 0.816 | 0.785 | 0.609 | 0.620 | 0.614 | 0.355 | 0.365 | 0.360 | |
| LLaMA 3.1 | CoT | 0.186 | 0.074 | 0.106 | 0.215 | 0.109 | 0.145 | 0.167 | 0.085 | 0.112 |
| Role-CoT | 0.503 | 0.362 | 0.421 | 0.444 | 0.294 | 0.354 | 0.260 | 0.179 | 0.212 | |
| ReAct | 0.393 | 0.205 | 0.269 | 0.391 | 0.217 | 0.279 | 0.265 | 0.147 | 0.189 | |
| Macro | ||||||||||
| Qwen3 | CoT | 0.789 | 0.857 | 0.802 | 0.710 | 0.687 | 0.684 | 0.524 | 0.518 | 0.511 |
| Role-CoT | 0.651 | 0.782 | 0.662 | 0.519 | 0.537 | 0.490 | 0.294 | 0.302 | 0.285 | |
| ReAct | 0.608 | 0.621 | 0.590 | 0.581 | 0.547 | 0.543 | 0.386 | 0.390 | 0.375 | |
| Gemma 3 | CoT | 0.781 | 0.867 | 0.795 | 0.614 | 0.625 | 0.591 | 0.404 | 0.440 | 0.404 |
| Role-CoT | 0.604 | 0.788 | 0.644 | 0.467 | 0.526 | 0.457 | 0.173 | 0.264 | 0.189 | |
| ReAct | 0.777 | 0.830 | 0.774 | 0.628 | 0.575 | 0.568 | 0.328 | 0.321 | 0.309 | |
| LLaMA 3.1 | CoT | 0.103 | 0.112 | 0.095 | 0.227 | 0.207 | 0.212 | 0.176 | 0.176 | 0.176 |
| Role-CoT | 0.325 | 0.357 | 0.326 | 0.382 | 0.327 | 0.340 | 0.223 | 0.224 | 0.218 | |
| ReAct | 0.192 | 0.195 | 0.179 | 0.329 | 0.288 | 0.300 | 0.260 | 0.239 | 0.246 | |
Table 3
Results for relation extraction across different LLMs and prompting strategies. For each model-prompt configuration, we report the number of concepts |ℂ|, the total number of predicted tuples |ℙ|, the total number of triplets (), the number of Iconclass notations with valid triplets (|ℕ|), the number without valid triplets (), and the mean number of triplets per Iconclass notations, i.e., .
| LLM | PROMPT | |ℂ| | |ℙ| | |ℕ| | |||
|---|---|---|---|---|---|---|---|
| Qwen3 | CoT | 241 | 162 | 164 | 71 | 14 | 2.3 |
| Role-CoT | 302 | 223 | 226 | 78 | 7 | 2.9 | |
| ReAct | 204 | 140 | 143 | 58 | 27 | 2.5 | |
| Gemma 3 | CoT | 256 | 182 | 186 | 83 | 2 | 2.2 |
| Role-CoT | 325 | 231 | 279 | 84 | 1 | 3.3 | |
| ReAct | 247 | 176 | 178 | 81 | 4 | 2.2 | |
| LLaMA 3.1 | CoT | 25 | 13 | 13 | 8 | 77 | 1.6 |
| Role-CoT | 131 | 67 | 72 | 35 | 50 | 2.1 | |
| ReAct | 57 | 24 | 24 | 11 | 74 | 2.2 |
Comparison of human and LLM-based judge
As outlined in Section 4.1, the identifying matches—i.e., semantically equivalent annotations—between ground truth and model predictions is crucial for computing performance metrics. Thus, we compared the micro and macro scores of the best-performing configuration—Qwen3 with CoT prompting—using assessments from both a human expert (denoted as H) and an LLM-based judge (denoted as LLM). For relation extraction, the micro and macro F1-scores increase from micro to micro , and from macro to macro , respectively. This suggests that the LLM-as-a-judge procedure is stricter in determining matches than the human annotator. Nevertheless, the moderate inter-coder agreement (Cohen’s κ = 0.455) indicates that our metrics still provide reliable insights into model performance.
4.3 Qualitative results
A qualitative comparison between the ground-truth and predicted triplets reveals several recurring patterns of divergence. Representative examples for selected Iconclass notations are summarized in Table 4. We observe these failure cases: (1) Due to the absence of a standardized relational schema, predicates are often expressed in lexically or morphologically altered forms (e.g., rests beside vs. resting_beside). (2) We also encounter inversions or substitutions in subject–object structure. In such cases, the model may reverse the direction of a relation or replace a specific entity with a more generic or semantically adjacent one (e.g., replacing two gates with the more specific ivory gate and horn gate). (3) At times, the underlying LLMs invoke implicit or extratextual knowledge, introducing relations that exceed what is explicitly encoded in the Iconclass description—for instance, inferring that ⟨Perseus, kills, Medusa⟩. However, these additions never introduce objectively false information; rather, they reflect contextually plausible over-interpretation. For the best-performing configuration, no genuine extratextual hallucinations are observed. (4) Incomplete predictions occur when valid ground-truth triplets are omitted (such as the missing relation ⟨Mary, together_with, John the Baptist⟩). (5) Finally, a small subset of outputs exhibit syntactic or logical incoherence, producing malformed constructions such as ⟨Perseus, death_of, death⟩.
Table 4
Qualitative comparison of ground-truth and predicted triplets (represented in the form subject-predicate-object) for selected Iconclass notations N, with associated descriptions DN shown in parentheses.
| N | GROUND TRUTH | PREDICTION | ||||
|---|---|---|---|---|---|---|
| SUBJECT | PREDICATE | OBJECT | SUBJECT | PREDICATE | OBJECT | |
| 93G (House of Sleep: a gloomy cave through which runs the river Lethe; possibly with two gates, […]) | ||||||
| River Lethe | runs through | House of Sleep | House of Sleep | runs_through | river Lethe | |
| ivory gate | issues | deceptive dreams | ivory gate | issues | deceptive dreams | |
| horn gate | issues | true dreams | horn gate | issues | true dreams | |
| House of Sleep | contains | two gates | House of Sleep | has_gate | ivory gate | |
| House of Sleep | has_gate | horn gate | ||||
| 11G311 (Michael resting beside the dragon’s corpse) | ||||||
| Michael | rests beside | dragon’s corpse | Michael | resting_beside | dragon’s corpse | |
| Dragon | represented as | corpse | ||||
| 11U4 (Mary and John the Baptist together with (e.g. kneeling before) the judging Christ, ’Deesis’ Last Judgement) | ||||||
| Mary | kneels before | Christ | Mary | kneels_before | judging Christ | |
| John the Baptist | kneels before | Christ | John the Baptist | kneels_before | judging Christ | |
| Mary | together with | John the Baptist | ||||
| Christ | represented as | Judge | ||||
| 29B (Plants behaving as human beings or animals) | ||||||
| plants | behave like | human beings | plants | behaving_as | human beings | |
| plants | behave like | animals | plants | behaving_as | animals | |
| 94P68 ((story of) Perseus – death) | ||||||
| Perseus | kills | Medusa | ||||
With the exception of the latter two categories, however, these issues are largely mitigated in practice when operating over the complete Wikidata corpus. The extensive redundancy and ontological depth of Wikidata absorb many of the lexical and semantic variations observed here: differing predicate forms or reversed subject–object structures rarely entail information loss as long as the underlying entities and relations are represented within the knowledge graph.
4.4 Results for the Wikidata dataset
Using the best-performing configuration from the previous experiment—Qwen3 with CoT—we extract 223 271 concepts, 104 843 tuples, and 104 886 triplets from 6356 Iconclass notations associated with 24 614 art-historical objects in Wikidata. Of these 6356 notations, 778 lack textual descriptions, as certain Iconclass notations can be flexibly extended through bracketed text and therefore do not provide default descriptions. In a further 915 cases, the LLM output does not yield extractable information. Consequently, a total of 4663 Iconclass notations is successfully analyzed.
5 Implications/Applications
In this section, we move from the technical implementation of the hybrid retrieval pipeline to its methodological and epistemic implications for art-historical research. Building on the iART framework (Springstein et al., 2021; Schneider et al., 2022),18 we demonstrate how the integration of VLM-assisted retrieval with metadata-driven faceting enables forms of inquiry that neither approach could sustain in isolation. This highlights the reciprocity of machine-constructed and human-constructed knowledge representations: VLM-derived embeddings do not replace the interpretive affordances of structured metadata systems such as Wikidata, but rather extend them, rendering their categorical logic dynamically negotiable within multimodal environments. To demonstrate this potential, we outline four exemplary application scenarios that translate art-historical questions into a computationally tractable form:
Faceting by Time and Geography: In accordance with Aby Warburg’s notion of Bilderfahrzeuge—images or motifs that migrate across time and geography—the temporal and geographical facets in iART enable researchers to trace visual trajectories across large collections. Researchers can thus observe how particular motifs—say, the “Amor” or the “Pietà”—recur, transform, and disseminate within distinct cultural and historical contexts (Figure 2). The temporal facet reveals how iconographic foci emerge, intensify, and decline;19 the geographic facet, in turn, visualizes the circulation of these foci on a map. The result is an operationalization of the Warburgian method: temporal and geographic metadata become analytical coordinates that reveal the migration of forms (Warburg, 1907).
Faceting by Gender: Faceted retrieval also exposes how gendered concepts structure multimodal search. Through the “depicts” (P180) and “gender” (P21) properties in Wikidata, iART renders visible the asymmetries latent in a model’s embedding space. Queries such as “wisdom” and “beauty” highlight that VLMs reproduce entrenched cultural hierarchies: the former is predominantly male-coded, the latter female-coded (Figure 3). What is here ‘retrieved’ is less a set of images than a cultural logic of association—the masculinization of intellect and the feminization of aesthetic value. Such tendencies mirror well-documented biases in large-scale training corpora, where men are discursively linked to professionalism or intellect, and women to appearance or external traits (Schwemmer et al., 2020; Radford et al., 2021). The faceted interface makes these distributions empirically graspable.
Faceting by Materiality: Materiality—represented in Wikidata through the property “made from material” (P186)—introduces a further axis for investigating VLM-based search results. Filtering search results by it allows one to test whether such models reinforce the conventions of Western art history, privileging canonical media such as oil painting or marble sculpture over regionally specific or non-canonical materials. A search for “warrior” or “market”, for instance, yields a preponderance of oil-on-canvas works (Figure 4), implying that the model recognizes the motifs most readily within European painterly traditions. Such clustering may stem from Wikidata’s overrepresentation of digitized Western collections, or from the VLM’s greater facility with familiar visual idioms. In either case, the facet shows that data composition and model architecture jointly define the material conditions of algorithmic legibility.
Faceting by Semantics: A further application concerns the semantic dimension of faceted retrieval. Revisiting the example introduced by Impett & Offert (2022), we query iART using the conceptually adjacent terms “nude” and “naked.” As Impett & Offert (2022) similarly observed for the Art Institute of Chicago’s collection, iART’s Wikidata-based environment returns overlapping yet divergent corpora: both sets predominantly feature unclothed human figures, but the facet distributions reveal subtle shifts that index the underlying model’s visual ideology (Figure 5). Results for “nude” show higher counts for facets such as “woman” and “female breast,” aligning with Kenneth Clark’s conception of the nude as “balanced, prosperous, and confident” (Clark, 1956). Searches for “naked,” by contrast, yield greater association with “Jesus Christ” and related iconography of suffering and revelation—corresponding more closely to John Berger’s understanding of nakedness as a state of exposure and being seen (Berger, 1972). The distribution of facets thus becomes a diagnostic of how cultural semantics are visually instantiated; the contrast between “nude” and “naked” exemplifies how VLM inherit and reconfigure art-historical conventions.

Figure 2
Search results in iART for the queries “Amor” (a) and “Pietà” (b), with the facets “time start” and “location” expanded.

Figure 3
Search results in iART for the queries “wisdom” (a) and “beauty” (b), with the facets “time start” and “depicts” expanded.

Figure 4
Search results in iART for the queries “warrior” (a) and “market” (b), with the facets “time start” and “material” expanded.

Figure 5
Search results in iART for the queries “nude” (a) and “naked” (b), with the facets “time start” and “depicts” expanded.
These scenarios demonstrate that extending Wikidata not only enhances retrieval but also reconfigures interpretation. By rendering temporal, spatial, gendered, and material dimensions explicitly navigable, multimodal systems like iART then reveal how art-historical meaning-making is produced through acts of selection, emphasis, and omission—whether human or algorithmic. It exposes the inherited taxonomies through which art history has been digitized, while enabling their reconsideration within the epistemic space of the multimodal model.
Additional File
The additional file for this article can be found as follows:
Notes
[1] For example, searching for “Leonardo” retrieves objects that are explicitly tagged with “Leonardo” in the metadata field “creator.”
[2] https://imgs.ai, last accessed on December 28, 2025.
[3] https://artsandculture.google.com, last accessed on December 28, 2025.
[4] https://www.wikidata.org/wiki/Wikidata:Main_Page, last accessed on December 28, 2025.
[5] See https://www.wikidata.org/wiki/Q2529962, last accessed on December 28, 2025.
[6] https://www.iart.vision/beta/search, last accessed on December 28, 2025.
[7] https://query.wikidata.org, last accessed on December 28, 2025.
[9] If “inception” is not available, we fall back to the properties “start time” (P580), “end time” (P582), and “publication date” (P577) to obtain time-related creation information.
[10] See https://www.wikidata.org/wiki/Q2529962, last accessed on December 28, 2025.
[11] Note that we also instruct the LLM to output related concepts without relations as an intermediate result. The complete prompt templates are provided in Appendix B.
[12] https://qdrant.tech/, last accessed on December 28, 2025.
[13] https://www.iart.vision/beta/search, last accessed on December 28, 2025.
[14] https://vuejs.org, last accessed on December 28, 2025.
[15] See, e.g., the Digital Collection of the Städel in Frankfurt (https://sammlung.staedelmuseum.de/en, last accessed on December 28, 2025).
[16] Note that the order of concepts within pairs and ℙN is considered interchangeable, as no directional relation is assumed for the task.
[17] https://ollama.com/, last accessed on December 28, 2025.
[18] https://www.iart.vision/beta/search, last accessed on December 28, 2025.
[19] While such patterns partially reflect the composition of the underlying dataset, they also suggest broader shifts in visual production and reception.
[20] https://gepris.dfg.de/gepris/projekt/510048106, last accessed on December 28, 2025.
Competing interests
The authors have no competing interests to declare.
Funding statement
These works are part of the project “Reflection-driven Artificial Intelligence in Art History,” which has been funded by the DFG since 2022 under project number 510048106 within the Priority Program The Digital Image.20 The collaboration between the TIB – Leibniz Information Centre for Science and Technology and the Chair of Medieval and Modern Art History at Ludwig Maximilian University of Munich adopts an interdisciplinary approach to the challenges of applying AI to art-historical image search and analysis. It seeks to identify and address the specific requirements of art history for the reflective and methodologically sound use of AI methods in historical research processes.
Author Contributions
Stefanie Schneider: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Project Administration, Writing – original draft; Matthias Springstein: Methodology, Software, Writing – original draft; Julian Stalter: Investigation, Project Administration, Writing – original draft; Daniel Ritter: Formal Analysis, Methodology, Writing – original draft; Ralph Ewerth: Project Administration, Writing – original draft; Eric Müller-Budack: Formal Analysis, Investigation, Methodology, Project Administration, Writing – original draft.
