Abstract
Vision-Language Models (VLMs) are increasingly employed in art-historical research and teaching infrastructures to enhance the discovery of images in large visual collections. Unlike traditional metadata-based systems—where retrieval depends on human-curated textual descriptors—VLM-based approaches enable content-aware searches that directly compare visual structures. However, VLMs also embed culturally situated biases, compressing visual phenomena into opaque statistical representations. For art history, this tension is particularly significant: while VLMs facilitate the detection of visual motifs beyond categorical restrictions, their interpretability and verifiability remain limited compared to metadata-based systems, which integrate centuries of scholarly and contextual expertise. In this paper, we argue that VLM-driven approaches should augment, not replace, metadata-based infrastructures. We present a hybrid retrieval pipeline that integrates VLM-derived embeddings with structured metadata from Wikidata, using faceting mechanisms to organize and navigate multimodal results. By also deriving triplet-based assertions of depicted entities and linking them to existing metadata, our approach enhances both relevance and transparency in art-historical search. Implemented within a retrieval environment, this system exposes cultural and epistemic biases in both datasets and models, contributing to a reflective framework for applying Artificial Intelligence (AI) in the humanities.
