The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Mohamed Chelali; Sylvain-Karl Gosselet; Florence Cloppet; Camille Kurtz; Isabelle Bloch; Daniel Foliard

doi:10.5334/johd.487

1 Introduction

The visual archives of early news picture agencies offer a challenging range of systematisation, from unstructured, undigitized collections to well-structured digitized databases. However, all of them are characterised by the incompleteness of their metadata. The majority of these archives are devoid of primary sources that detail editorial activities, authorship, distribution and retouching of images. Therefore, the contextual analysis of such large historical visual archives presents significant obstacles to researchers. Since no expert can single-handedly overlook such an overwhelming number of prints, conventional historiographical approaches to these photographs fail to capture visual literacies and cultures that were created by the circulation of these images on this unprecedented global scale. Their many transformations and trajectories into printed photo engravings and illustrations are hardly taken into account at scale and across entire sets. By contrast, artificial intelligence (AI) and computer vision (CV) methods allow exploring large quantities of data, to solve different tasks such as text recognition, image similarities, and image description (He et al., 2018; Radford et al., 2021). However, the performance of these methods is highly dependent on the availability of high-quality, domain-specific annotated datasets.

Advancements in data-driven approaches, such as deep learning techniques used for document analysis and image analysis, have been supported by the increased availability of extensive datasets, enhanced storage capacity, and extending computational resources. Many datasets have been created for specific tasks. Well-known examples are the ICDAR (International Conference on Document Analysis and Recognition) benchmark competitions, specifically designed to evaluate and advance document reading systems (Antonacopoulos et al., 2007; Simistira et al., 2017) and COCO-Text (Veit et al., 2016), which has significantly contributed to progress in scene text detection and recognition. In the field of digital humanities, other datasets have been released, such as READ for handwritten manuscripts (Gruning et al., 2018), or the FINLAM¹ (Studer et al., 2023) dataset collected from digitized documents from the annotated collections of the BnF (Bibliothèque nationale de France). These resources are valuable benchmarks, but they mainly focus on modern images or textual heritage like manuscripts and books. Several datasets are dedicated to historical photographs, but few include large collections of news pictures²,³ (Smits et al., 2025).

Moreover, existing resources face several challenges when applied to historical visual archives. These data are often heterogeneous and degraded. They include halftone prints, handwritten captions, or aged paper textures that differ significantly from the clean, modern data on which most AI models have been trained (Colavizza et al., 2022; Wu et al., 2023). Figure 1 shows two examples of digitized historical photographs extracted from the EyCon dataset (Giardinetti et al., 2024) and the proposed Forbin datasets. We observe that contrast and format are not the same in these two photographs. One comes from an amateur album whose author provided photographs to the press, the other is a professional photograph, ready for publication, with well-preserved contrast. This domain gap complicates transfer learning and requires adaptation techniques to bridge the visual and material differences. Many existing datasets also ignore the archival and contextual aspects of images, like their origin, use, or physical annotations, which are essential for humanities research (Ehrmann et al., 2022). Recent initiatives such as Human Machine Culture (Colavizza et al., 2022), NewsEye (Ehrmann et al., 2022), EyCon (Giardinetti et al., 2024) and Impresso (Lo et al., 2023) have demonstrated the potential of interdisciplinary collaboration between computer scientists, historians, and archivists, highlighting the need to design datasets that combine computational utility with historical accuracy (Arnold & Tilton, 2023; Wu et al., 2023).

Illustration of the visual heterogeneity across heritage datasets. The EyCon photograph (Giardinetti et al., 2024) shows overlapping prints, typical of archival contact sheets, while the Forbin image presents a caption stitched to the lower margin in the recto and stamps in the verso. Variations in contrast and tone result from differences in material supports and digitization workflows.

In this paper, we introduce the historical Forbin dataset that contains a unique collection of photographs captured by the French journalist and photographer Victor Forbin (1864–1947). Victor Forbin started his career as a journalist in the 1880s and published in major French periodicals such as the Excelsior, the Journal des Voyages, the Revue des deux mondes and L’Illustration. Forbin’s artisanal approach to his photographic trade – a symptom of the yet unstructured economy of news images before the late 1920s – explains why the collection is organized according to an ad hoc classification system (both geographical and thematic) of his own design. He eventually developed his own news picture agency in the early 20th century. Its activity peaked in the 1910s and 1920s. Forbin bought prints from London, Berlin, Rome and New York and sold them to some of the most popular French newspapers of that time. Due to their sheer volume, they are impossible to investigate manually at scale, and their metadata are sparse and limited. Compared to other datasets, the Forbin collection is largely uncurated, with little metadata, and with a wide variety of photographic formats. This dataset of photographs is composed of recto–verso image pairs, providing both the front and back sides of each photograph. While the recto typically contains the visual content, the verso often includes textual information such as captions, stamps, or press-edition annotations. Both sides can be exploited, after digitization, for metadata extraction and document analysis (Antonacopoulos et al., 2007).

The Forbin dataset not only offers visual information, but it also includes details on image archiving, geographical data, and thematic classification. This contextual layer offers valuable insights into the history of early news photographs and mass visual culture. By combining photographic material with a diverse set of annotated elements on the verso of the images, ranging from handwritten and typewritten notes to stamps and signatures, the dataset supports a broad range of research directions in document analysis (Fischer et al., 2010; Gruning et al., 2018), image retrieval (Efthymiadis et al., 2025), multimodal understanding, and digital humanities (Du et al., 2024).

The remainder of this article is organized as follows. Section 2 provides an overview of existing datasets and some related works. Section 4 introduces the Forbin dataset, detailing the historical photographs and associated metadata. Section 5 describes the distribution format of images, metadata and annotations. Section 6 presents potential research tasks and concrete use cases enabled by the dataset. Section 7 discusses the dataset scope and contribution. Finally, the conclusion and perspective are summarized in Section 8.

2 Related datasets

The progression of AI models in computer vision largely depends on the availability of extensive public datasets used for training, testing, and evaluation. In the context of this study, these datasets can be broadly grouped into three main categories: (i) generic datasets, (ii) datasets designed for text detection and recognition, and (iii) patrimonial datasets for historical visual content analysis. Figure 2 illustrates representative datasets from each category. We also indicate the usage and the size for each dataset. In the following subsections, we present these datasets and highlight their respective characteristics and limitations.

Illustration that compares the scale of the proposed Forbin dataset with a variety of existing benchmarks. Each dataset is represented in blue for monomodal data and in red for multimodal data, positioned according to its targeted tasks on the vertical axis and its total number of samples on the horizontal axis.

2.1 General vision and multimodal benchmarks

The well-known ImageNet dataset (Krizhevsky et al., 2012) represents a cornerstone for deep learning approaches in computer vision. It contains more than 14M labeled images clustered into 20,000 categories, and it has been a decisive benchmark for advancing classification methods in ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

For object detection and segmentation, the Microsoft COCO dataset (Lin et al., 2014) has become a standard reference. It includes over 330,000 images, of which 200,000 are labeled, with 1.5M object instances spanning 80 categories. Its dense annotations and complex scenes make it particularly suited for benchmarking object detection and segmentation systems.

More recently, the LAION-5B dataset (Schuhmann et al., 2022) is a large-scale multimodal resource composed of 5.85 billion image–text pairs harvested from the internet. It enables the training of large vision–language models such as CLIP (Radford et al., 2021) and generative models such as Stable Diffusion (Zhang et al., 2025). This type of dataset is used to train AI models with weak or few annotations, which is the current trend in computer vision. Rather than relying on exhaustive manual labeling, these datasets leverage the sheer volume of data and multimodal signals to provide indirect supervision. This paradigm allows models to learn rich visual and semantic representations from sparsely annotated data, where the scale and diversity of the corpus compensate for the low density of explicit annotations.

While these datasets are not primarily designed for text analysis, their scale and visual diversity provide indirect benefits to text detection and recognition in complex real-world scenes. However, these datasets predominantly feature contemporary objects and environments, making them temporally and semantically distant from the historical collections.

2.2 Datasets for text and document understanding

Text detection and recognition is one of the major challenges in computer vision. To foster progress in this field, the ICDAR consortium has organized several benchmark competitions addressing layout analysis, text detection and recognition tasks. The ICDAR2007 Page Segmentation Competition (Antonacopoulos et al., 2007) focused primarily on document layout analysis, providing around 1,000 document pages with detailed ground-truth segmentation for structural understanding of printed materials. The ICDAR2017 Robust Document Layout Challenge (RDCL2017) (Clausner et al., 2017) provided more than 2,000 images with complex layouts, including newspapers, magazines, and structured documents, where image segmentation and text recognition were jointly evaluated.

Beyond these, other benchmarks have also played a central role. The COCO-Text dataset (Veit et al., 2016), derived from the Microsoft COCO collection, contains over 63,000 natural images and more than 173,000 text annotations spanning machine-printed and handwritten text. The SynthText dataset (Gupta et al., 2016) generates 800,000 synthetic images with realistic scene text rendering, widely used for pretraining detection and recognition models. The Street View Text (SVT) dataset (Wang et al., 2011) includes 350 images collected from Google Street View, focusing on text in urban environments.

Overall, these datasets, spanning printed, handwritten and scene text, become standard references for evaluating text detection and recognition in natural images and complex layouts, supporting the development of increasingly robust segmentation and recognition algorithms.

2.3 Cultural and historical archives

In the field of digital humanities, several datasets specifically target historical documents. The READ project (Gruning et al., 2018) provided over 400,000 annotated manuscript pages for Handwritten Text Recognition (HTR). The IMPACT project (Papadopoulos et al., 2013) produced more than 20,000 digitized early printed books across 25 languages, supporting large-scale Optical Character Recognition (OCR). The ICFHR2018 competition (Strauß et al., 2018) exploited the READ corpus to benchmark transcription systems, involving several thousand handwritten pages. More recently, FINLAM (Studer et al., 2023) introduced a corpus of Finnish historical newspapers with 161,000 pages and multimodal annotations, including layout, text, and named entities. In the same domain, Gutehrlé and Atanassova (Gutehrlé & Atanassova, 2021) proposed a dataset for logical layout analysis of historical newspapers, designed to support structural segmentation and content extraction tasks. Complementarily, the Newspaper Navigator dataset (Lee et al., 2020) extracted textual and visual content from over 16M historic newspaper pages in the Chronicling America collection, enabling large-scale retrieval and visual content analysis.

Beyond textual transcription, other initiatives address complementary challenges. The British Library’s IIIF collections host more than 1M digitized archival items with metadata on provenance and circulation, while Europeana collected over 50M cultural heritage objects including manuscripts, photographs, artworks, newspapers, and audiovisual materials, across institutions. Other datasets focus on image description and region-level annotation, supporting captioning and semantic alignment tasks. For example, Visual Genome (Krishna et al., 2017) contains 108,000 images with 5.4M region descriptions and 1.7M question–answer pairs. Another example is Flickr30k Entities (Plummer et al., 2015) composed of 31,000 images and 244,000 region–phrase correspondences. In the same context, the Bain Collection (Arnold & Tilton, 2023), comprising over 40,000 historical press photographs from the early 20th century, aims to preserve and facilitate the study of early photojournalism, serving as a resource for image classification, caption analysis, and historical media research. Similarly, the Finnish WWII Photographs dataset (Chumachenko et al., 2020), containing around 160,000 wartime images from the Finnish Defence Forces, was created to document and analyze the visual history of Finland during World War II, supporting tasks such as scene classification, object detection, and metadata enrichment in historical contexts.

Recently, the EyCon dataset (Giardinetti et al., 2024) has extended this line of research by providing a visual corpus of early conflict photography, composed of over 130,000 digitized images documented between the 1890s and 1918 across various theatres of war, including colonial campaigns and the First World War. This dataset allows analysing the historical propaganda and the social circulation of conflict imagery. The FAIT Photo initiative (van Wissen et al., 2025) goes further by transforming a collection of two million historical press photographs into a FAIR-compliant (Findable, Accessible, Interoperable, Reusable) linked-data resource, promoting large-scale visual and semantic exploration of press archives. Finally, the Datasheet for Digital Cultural Heritage datasets (Alkemade et al., 2023) proposes a methodological framework to improve the transparency, documentation, and reusability of heritage datasets, offering metadata templates and best practices to ensure the ethical and interoperable publication of cultural data.

These resources illustrate the diversity of available corpora, ranging from large-scale handwritten and printed text datasets to multi-modal newspaper archives. Thanks to the effort of many contributors and different research communities, they are richly annotated for descriptive purposes and circulation studies.

2.4 Limitations of existing datasets

In order to compare the presented datasets, Figure 2 provides an overview of the discussed datasets, highlighting their target tasks, scale, and degree of modality (mono- or multimodal). This comparison also underlines their respective limitations in terms of scope, annotation depth, and contextual richness. Generic benchmarks such as ImageNet, MS-COCO or LAION-5B focus mostly on classical computer vision tasks such as image classification, segmentation or multi-modality representation, often with weak or automatic supervision, while datasets such as READ (Gruning et al., 2018), IMPACT (Papadopoulos et al., 2013) and FINLAM (Studer et al., 2023) focus on a single modality, e.g., text, and lack photographic content or recto-verso pairs. Other benchmarks like ICDAR (Antonacopoulos et al., 2007; Clausner et al., 2017; Simistira et al., 2017) and COCO-Text (Veit et al., 2016) are primarily modern images with synthetic text, providing little information about archival context. Finally, datasets such as the Bain Collection (Arnold & Tilton, 2023), EyCon (Giardinetti et al., 2024) and the Finnish WWII Photographs (Chumachenko et al., 2020) include historical photographs with few metadata, but they remain comparatively restricted in scale, diversity, or annotation richness. To facilitate a clear comparison, Table 1 summarizes the sizes and tasks covered by cultural and historical archives.

Table 1

Comparison of cultural and historical datasets according to covered tasks. While most existing collections focus on a single objective, such as handwritten text recognition, document layout analysis, or photo archiving, the Forbin dataset stands out by jointly addressing multiple complementary dimensions. It combines annotated historical photographs with metadata, textual content (both printed and handwritten), and scene text annotations, offering a unified resource for multimodal analysis of historical visual materials.

DATASET	SIZE	HANDWRITTEN TEXT RECOG.	METADATA	HISTO. PHOTOS	OCR & SCENE TEXT	LAYOUT & DOC. ANALYS.
READ	400K	✓	—	—	—	—
IMPACT	20K	—	—	—	✓	—
ICFHR2018	3K	✓	—	—	—	—
FINLAM	161K	—	—	—	—	✓
Newspaper Navigator	16M	—	—	—	—	✓
Bain Collection	40K	—	—	✓	—	—
Finnish WWII	160K	—	✓	✓	—	—
EyCon	130K	—	✓	✓	—	—
Forbin dataset	120K	✓	✓	✓	✓	—

Critical perspectives on dataset genealogy (Denton et al., 2021) highlight that datasets used in machine learning are socio-technical constructions that embed assumptions about knowledge, labor, and context. Many existing corpora prioritize scale and compatibility over historical or archival provenance, treating data as decontextualized visual facts.

3 Dataset description

Repository location

https://doi.org/10.57967/hf/7189

Repository name

HuggingFace

Object name

Forbin Dataset

Format names and versions

JPG and JSON.

Creation dates

2024-11-19 to 2025-12-09.

Dataset creators

Mohamed Chelali (post-doctoral researcher, CNRS, Université Paris Cité), Sylvain-Karl Gosselet (researcher, CNRS, Université Paris Cité), Florence Cloppet (professor, LIPADE, Université Paris Cité), Camille Kurtz (professor, LIPADE, Université Paris Cité), Isabelle Bloch (professor, LIP6, Sorbonne Université, France), Daniel Foliard (professor, ECHELLES, Université Paris Cité),

Language

French and English

License

Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Publication date

2025-12-09.

4 The Forbin dataset

In this section, we introduce the Forbin dataset, a novel resource that contains more than 60,000 digitized photographs specifically designed to support research in computer vision and digital humanities. Unlike existing benchmarks that usually focus on either text or visual content in isolation, the Forbin dataset offers a unique combination of digitized recto–verso photographic documents enriched with archival metadata dating from the late 19th to early 20th century. Figure 3 presents the construction process of the dataset and some samples that highlight its content. The following subsections describe in detail the dataset collection process, its composition together with the adopted annotation strategy, and the potential research challenges it enables.

Overview of the construction workflow of the Forbin dataset and selected samples. **(a)** Photographs were extracted from archival boxes, scanned, and digitized. Metadata related to provenance and digitization conditions were collected during this process. **(b)** Examples of digitized samples illustrating the visual diversity and quality of the data. The recto side may feature original prints or images that were retouched photographs, ready for layout design and publication, while the verso side often contains handwritten notes, captions in multiple languages, stamps, and editorial marks.

4.1 Data collection

The Forbin dataset is built from digitized photographs that were archived by le Service Historique de la Défense of France. The preservation structure used for this collection is that of the owner. In particular, the photos are grouped by theme in envelopes and the envelopes are sorted into boxes by geographical area. The collection is composed of 260 boxes with approximately 240 photographs per box.

The collection was digitized by Azentis,⁴ a French private company which specialises in handling archival material. Since the original materials include various photographic formats, including printed photographs, glass plates, and both positive and negative films, Azentis selected two professional camera systems to ensure optimal quality. For opaque documents, digitization was performed with a Canon 5DSR digital camera (50 MP, full-frame CMOS sensor, 5792 × 8688 pixels) equipped with Canon 70 mm f/2.8 DG MACRO and 50 mm f/1.4 DG HSM lenses. For celluloid film and glass plates requiring higher precision, a Fujifilm GFX 100 S medium-format camera (102 MP, stabilized CMOS II sensor) was used, paired with GF 120 mm f/4 Macro, GF 80 mm f/1.7, and GF 50 mm f/3.5 lenses.

Both setups relied on flat digitization tables with removable glass and a Kaiser Executive LED Illumina Base Plate 5215 light table for transparent media. Illumination was provided by Godox AD300pro cold-flash units with diffusers, complemented by color masks (black, gray, beige) to ensure uniform lighting and minimize reflections.

Once the digitization process was completed, only 256 boxes were considered, providing a total of 120,106 photographs. Four boxes remain undigitized, with two being unavailable and two that were previously digitized but not included in this dataset. Each photograph is digitized on both sides, resulting in 57,971 recto-verso pairs. The collection contains 4,146 samples of negative and positive softs labeled as recto only and 7 labeled as verso only, mainly corresponding to unidentifiable or administrative scans.

A dedicated post-processing stage was applied to ensure high-quality and visually consistent digital images. This process involved color calibration, adjustment of contrast and brightness levels, and precise cropping to optimize framing. During the digitization process, archival information was systematically recorded to preserve the original organizational structure and to associate metadata with each photograph. The digitized photographs of each box are stored in the same folder in order to preserve the hierarchical arrangement of the physical archive. Each image follows a consistent naming convention:

SHDGR_2_K_247_<BoxNum>_<ImgNum>_<idx>

In this naming convention, SHDGR_2_K_247 identifies the service and archival series, <BoxNum> designates the box identifier, <ImgNum> specifies the photograph index within the box, and <idx> takes the value 0 or 1, denoting the recto and verso sides, respectively. In some cases, the verso was scanned twice to obtain information on a piece of paper that is glued on, resulting in an incremented <idx>. To complement this system, a tabular file was produced, gathering all archival metadata in a single structured resource that links directly to the digitized images. Figure 3.a illustrates the collection process of the metadata.

4.2 Data composition

The Forbin dataset is composed of 120,106 photographs and metadata. The digitization process was carried out at a resolution of 300 DPI, which corresponds to a physical pixel size of approximately 0.085 mm. The size of these photographs ranges from 399 × 411 to 8036 × 8600 pixels, corresponding approximately to physical sizes between 3.4 × 3.5 cm for the smallest and 68.0 × 72.8 cm for the largest photographs. On average, the digitized photographs measure about 23.1 × 20.0 cm, which is close to the format of a medium-sized historical print. This wide variability reflects both the diversity of the original photographic materials and the characteristics of the scanning devices used for digitization. Figure 4 illustrates three examples of recto-verso images.

Three pairs of images from the proposed Forbin dataset, showing both the recto and verso sides (first and second rows), together with their associated metadata (bottom row). The metadata include the image identifier, archival box name, and country of origin. The continent field results from a manual classification into continental or thematic groups when the geographical origin is unknown, while the cluster and cluster name are derived from the HDBSCAN clustering applied to the class attributes across the entire dataset.

Adding to this collection, each image is accompanied by a structured set of archival metadata, extracted during the digitization process:

Class: the thematic context of the photograph (e.g., military, geography, culture);
Continent: geographical region corresponding to the photograph subject;
Subject/Country: original subject label or country name;
Real Size (cm): physical size of the original photograph;
Size (pixel): size of the digitized image.

The raw metadata extracted from the archives were initially heterogeneous and incomplete. The metadata is written in French, and it has been preserved in its original form. To improve coherence, we classified the country field into “Continent” and “Thematic” categories. This normalization was performed by manually constructing a controlled dictionary to prevent misclassification caused by historical or obsolete city names. The second step focuses on the analysis of the thematic fields. Initially, the dataset contains 674 distinct class labels, many of which are semantically redundant. For example, “industrie Panama”, “industrie infrastructure” and “industrie agriculture” all describe linked concepts. To organize these labels, we used a semantic clustering method. Each class name was encoded into a BERT embedding, after which HDBSCAN clustering (Campello et al., 2013) was applied to group them semantically. This method is a hierarchical density-based clustering method that groups data points by local density while automatically determining the optimal number of clusters and handling noise. The minimum size of a cluster is set to three. Noise points were labeled as the ‘Other’ class. For the remaining classes, we merged clusters whose centroids had a similarity greater than 95%, ensuring that semantically close clusters were unified. The used metric is the cosine distance. Each cluster was then labeled by its most representative term, chosen as the member closest to the centroid. Finally, clusters labels derived from the analysis were applied to the entire dataset, yielding 30 coherent thematic groups such as “Vie militaire”, “Paysage, Architecture”, or “Industrie”. Figure 4 presents metadata of the three examples and Figure 5 illustrates with pie charts the proportion of obtained results for each continent, polar regions and islands.

Distribution of the 30 semantic classes across continents. The pie charts reveal a marked thematic imbalance, with “Military life” dominating in most regions except Africa. This visualization underscores regional variations in subject matter within the Forbin dataset. The complete list of classes and their corresponding color codes are provided below.

4.3 Data annotation

In addition to the digitization and metadata collection, a subset of the Forbin dataset has been manually annotated. These annotations primarily concern the verso side of the photographs, which often contains textual elements such as titles, captions, and press stamps. They are made by using the Arkindex⁵ platform, which is a solution proposed by Teklia.

The annotation process focused on identifying and transcribing specific text regions, typically at the block level. Blocks represent textual regions that convey a coherent semantic unit, such as title or caption, rather than performing fine-grained segmentation at the character or word level. Each text block is annotated using a polygon defined by at least four points, allowing the contour to closely follow the shape and orientation of the text rather than imposing a rigid rectangular box. In total, seven distinct text categories are used to label these text regions. A super-categorization step is then applied to merge labels whose semantic proximity indicates that they belong to the same category. Furthermore, some samples from the annotated subset are enriched with external bibliographic links to historical journals when identifiable. Several photographs are cross-referenced with their occurrences in the Gallica digital library of Bibliothèque Nationale de France,⁶ thereby aligning the Forbin annotations with existing archival metadata and enhancing their interoperability for cultural heritage research.

The manual annotation campaign covered approximately 687 verso images, representing about 1.2% of verso of the Forbin dataset. Each annotation was carried out by trained operators and subsequently validated by archivists to ensure accuracy and consistency. In total, the annotated corpus contains 687 annotated verso images, by at least having two text regions per image, amounting to 1767 individual annotations. Figure 6 illustrates the distribution of annotation categories, where bars of the same color denote shared super-categories.

Number of instances per category of annotated images. The same super category is formed by bars that share the same color.

To illustrate the variety of our annotations, Figure 7 presents six representative examples that highlight the diversity of the dataset content. The first column shows two images containing only a small amount of text, such as stamps and short handwritten notes. The second column features a typewritten document combined with stamps. Finally, the last column presents a letter, where the second image includes a legend, Forbin’s signature, and a stamp. The remaining text on this image is not annotated, as it is considered editorial information intended for publication but can be considered to identify published photographs.

Annotated photographs from the Forbin dataset. Each image displays polygons corresponding to different annotation categories, with transcribed text linked to each region. The examples highlight the diversity of text layouts, orientations, and typographies captured in the dataset.

5 Data format and accessibility

The Forbin dataset is structured in a modified COCO-style JSON format (Lin et al., 2014). This solution allows preserving compatibility with common computer vision pipelines while incorporating domain-specific metadata relevant to historical archives. Two JSON files are published. The first one contains all pairs of photographs with their metadata and annotations. The second JSON file focuses only on the annotated subset.⁷ Each JSON file contains two main components:

Image entries, describing each pair with a unique “id”, dual-face file references (“recto” and “verso”), and a rich metadata dictionary. The metadata include both technical properties (e.g., digital size, and physical size in inches and centimeters) and archival descriptors such as storage unit identifiers, geographical or historical context and classification tags. This combination ensures both traceability within the physical archive and interoperability with digital repositories;
Annotation entry, linking each annotated region to its corresponding image via “image_id”. Each annotation includes the bounding box, polygon segmentation and geometric statistics. In addition, textual metadata enrich the annotation: the field “text” stores the transcribed content, “orientation” indicates the reading direction and “text_type” specifies the semantic role of the text zone (e.g., caption, title, signature).

This extended COCO structure is selected to ensure flexibility for downstream tasks such as text detection, transcription, and layout analysis, while maintaining a clear linkage between visual, textual, and archival metadata.

6 Tasks and potential uses

The Forbin dataset supports a wide range of research questions at the intersection of computer vision, document analysis and digital humanities. Its multimodal structure and archival context enable from low-level recognition tasks to large-scale cultural analysis. Some of them are briefly described below, without being exhaustive:

Text and graphic element analysis. The presence of handwritten and typewritten texts, stamps and other classes on the verso of photographs provides a challenging task for text recognition in degraded historical documents, graphic symbol detection, and layout analysis (Fischer et al., 2010; Gruning et al., 2018). The coexistence of textual and non-textual elements further supports research on structured document parsing and the identification of visual cues related to authorship, provenance, and circulation. The dataset is particularly suitable for fine-tuning and evaluating modern document understanding neural architectures such as LayoutParser;
Multimodal description and metadata enrichment. The joint availability of photographic content, textual annotations and structured archival metadata supports cross-modal alignment between images and texts in an automated or semi-automated strategy. This supports caption generation, thematic or geographic classification and visual-textual alignment using contrastive multimodal models (e.g. CLIP-like). Such approaches may contribute to metadata enrichment and assist archivists in the large scale structuring and indexing of historical collections;
Image circulation and reuse in the press. By combining visual similarity measures with archival metadata, the dataset allows for the identification of reused or republished photographs across different contexts (Efthymiadis et al., 2025). This analysis can thus allow the reconstruction of photographic dissemination networks / graphs, the study of editorial practices and the tracing of thematic or geographic distributions across time (Smits & Ros, 2023);
Preservation-oriented and degradation analysis. The dataset contains numerous examples of aging-related degradations, including faded ink, stains, bleed-through and other material alterations. This makes it suitable for research on degradation detection, robustness evaluation of recognition systems under historical noise, and computational approaches supporting long-term digital preservation strategies (Aissi et al., 2025);
Large-scale exploration of visual culture. The dataset supports quantitative studies of visual themes, editorial choices and the geographic distributions of news photography. Such analyses open new and rich perspectives on mass visual culture and media history.

7 Discussion

Existing historical and cultural datasets often exhibit strong task bias and limited interoperability, reflecting the specific goals, purposes, or modalities for which they were designed, and consequently constraining the scope of the research they enable. The Forbin dataset occupies a distinctive position at the intersection of archival research and multimodal machine learning. Its design seeks to overcome the fragmentation of prior resources by integrating diverse modalities, that are visual, textual, and contextual, within a unified archival framework.

The READ (Gruning et al., 2018), IMPACT (Papadopoulos et al., 2013) and FINLAM (Studer et al., 2023) datasets remain limited to the textual modality, excluding visual or photographic content that could reveal broader historical narratives. Other datasets, such as the ICFHR2018 (Strauß et al., 2018) competition corpus or COCO-Text (Veit et al., 2016), emphasize segmentation and transcription tasks but rely on modern or synthetic documents, detached from historical variability and material degradation. They therefore do not reflect the challenges of real-world archival documents, such as faded ink, bleed-through, or complex page structures.

Photographic datasets like the Bain Collection (Arnold & Tilton, 2023), Finnish WWII Photographs (Chumachenko et al., 2020), EyCon (Giardinetti et al., 2024) or Newspaper Navigator (Lee et al., 2020) offer valuable visual archives but suffer from limited metadata or small-scale coverage. These limitations restrict their potential for multimodal research that connects visual, textual, and contextual dimensions of historical sources.

By contrast, the Forbin dataset is designed to bridge these limits. It combines over 120,000 digitized photographic and textual elements through paired recto-verso pages. The dataset was enriched with metadata that preserve archival provenance and material context. Its scale places it among the largest collections of historical visual data that support both statistical exploration and model training while preserving curatorial depth and archival coherence. Compared to historical collections, such as Bains, EyCon and Finnish WWII, the Forbin dataset introduces several novel aspects:

Multimodality: It integrates both photographic and textual content (handwriting, printed materials, stamps, signatures) within the same pages and across recto–verso pairs;
Rich contextual metadata: Each item is accompanied by curated information on provenance, authorship, and editorial context, enabling semantic and temporal linking across documents;
Balanced annotation strategy: By combining detailed manual annotations with scalable semi-automatic processing, Forbin ensures both quality and volume, bridging the gap between small, highly curated corpora (e.g., READ) and massive, weakly annotated collections (e.g., Newspaper Navigator).

The Forbin dataset redefines what constitutes relevant information in the computational study of visual archives. By integrating multiple modalities and maintaining a strong link to their historical and archival context, it establishes a framework for the automated tracking of image reuse in periodicals such as the newspaper L’Illustration, and for the enrichment of metadata through external knowledge bases such as Wikidata and other related cultural repositories. In this sense, Forbin does not merely increase the amount of available data, it reframes the conceptual scope of visual archive analysis, promoting a more reflexive, interpretable, and interdisciplinary exploration of cultural heritage at the crossroads of archival research and multimodal machine learning.

8 Conclusion

The proposed Forbin dataset constitutes a significant contribution to digital humanities and computer vision fields. This dataset is designed to support large-scale research on historical photographic archives. The annotated subset provides a high-quality foundation for layout segmentation, text region classification, and semantic analysis of historical documents. The Forbin dataset is structured in a COCO-inspired JSON format by integrating visual, textual and archival metadata in a unified schema. This hybrid representation not only preserves essential physical information (dimensions and localisation) but also links each region to transcribed content and its archival reference. Such interoperability enables both computer vision and digital humanities workflows, bridging historical expertise with machine learning pipelines.

From a methodological standpoint, the Forbin dataset establishes a foundation for multimodal document understanding in historical archives. It provides a valuable benchmark for studying the specific challenges of cultural heritage imaging, including degradation analysis and robustness to historical noise. Moreover, the coexistence of photographic content, recto–verso structure and metadata opens new research avenues in multimodal document understanding, enabling models to jointly reason over visual and textual information.

In future work, the annotated corpus will be extended and released as a Hugging Face dataset, allowing for reproducible research and seamless integration into deep learning frameworks. Subsequent phases of the project will focus on automated metadata enrichment, leveraging object detection and inter-image similarity to propagate and harmonize metadata across collections. This will contribute to reconstructing the circulation and reuse of news photographs across archives and publications.

Ultimately, the Forbin dataset illustrates the potential of combining computer vision and archival science to transform historical image corpora into structured, searchable, and analyzable resources, opening new perspectives for digital humanities, heritage AI, and historical image retrieval.

Notes

[1] https://projets.litislab.fr/finlam/.

[2] https://lab.kb.nl/dataset/kbk-1m.

[3] https://github.com/melvinwevers/HisVis2.

[4] https://azentis.com/.

[5] https://teklia.com/our-solutions/arkindex/.

[6] https://gallica.bnf.fr.

[7] A subset is available for illustration in: https://mchelali.github.io/forbin_dataset/.

Acknowledgements

This work is supported by the French National Research Agency under the ANR-24-CE38-4079 project. We thank Sihem Yousfi and Marie-Lousie Kitoko for their valuable collaboration with annotation and the identification of photographic circulation.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Mohamed Chelali: Conceptualization; Data Curation; Investigation; Validation; Writing – Original Draft; Writing – Review & Editing; Project Administration.

Sylvain-karl Gosselet: Data Curation; Investigation; Validation; Writing – Review & Editing.

Florence Cloppet: Visualization; Formal Analysis; Writing – Review & Editing.

Camille Kurtz: Visualization; Formal Analysis; Writing – Review & Editing.

Isabelle Bloch: Visualization; Formal Analysis; Writing – Review & Editing.

Daniel Foliard: Supervision; Project Administration; Funding Acquisition; Visualization; Writing – Review & Editing.