The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Mohamed Chelali; Sylvain-Karl Gosselet; Florence Cloppet; Camille Kurtz; Isabelle Bloch; Daniel Foliard

doi:10.5334/johd.487

The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Mohamed Chelali , Sylvain-Karl Gosselet , Florence Cloppet , Camille Kurtz , Isabelle Bloch and Daniel Foliard

Open Access

|Apr 2026

Figures & Tables

Illustration of the visual heterogeneity across heritage datasets. The EyCon photograph (Giardinetti et al., 2024) shows overlapping prints, typical of archival contact sheets, while the Forbin image presents a caption stitched to the lower margin in the recto and stamps in the verso. Variations in contrast and tone result from differences in material supports and digitization workflows.

Illustration that compares the scale of the proposed Forbin dataset with a variety of existing benchmarks. Each dataset is represented in blue for monomodal data and in red for multimodal data, positioned according to its targeted tasks on the vertical axis and its total number of samples on the horizontal axis.

Table 1

Comparison of cultural and historical datasets according to covered tasks. While most existing collections focus on a single objective, such as handwritten text recognition, document layout analysis, or photo archiving, the Forbin dataset stands out by jointly addressing multiple complementary dimensions. It combines annotated historical photographs with metadata, textual content (both printed and handwritten), and scene text annotations, offering a unified resource for multimodal analysis of historical visual materials.

DATASET	SIZE	HANDWRITTEN TEXT RECOG.	METADATA	HISTO. PHOTOS	OCR & SCENE TEXT	LAYOUT & DOC. ANALYS.
READ	400K	✓	—	—	—	—
IMPACT	20K	—	—	—	✓	—
ICFHR2018	3K	✓	—	—	—	—
FINLAM	161K	—	—	—	—	✓
Newspaper Navigator	16M	—	—	—	—	✓
Bain Collection	40K	—	—	✓	—	—
Finnish WWII	160K	—	✓	✓	—	—
EyCon	130K	—	✓	✓	—	—
Forbin dataset	120K	✓	✓	✓	✓	—

Overview of the construction workflow of the Forbin dataset and selected samples. **(a)** Photographs were extracted from archival boxes, scanned, and digitized. Metadata related to provenance and digitization conditions were collected during this process. **(b)** Examples of digitized samples illustrating the visual diversity and quality of the data. The recto side may feature original prints or images that were retouched photographs, ready for layout design and publication, while the verso side often contains handwritten notes, captions in multiple languages, stamps, and editorial marks.

Three pairs of images from the proposed Forbin dataset, showing both the recto and verso sides (first and second rows), together with their associated metadata (bottom row). The metadata include the image identifier, archival box name, and country of origin. The continent field results from a manual classification into continental or thematic groups when the geographical origin is unknown, while the cluster and cluster name are derived from the HDBSCAN clustering applied to the class attributes across the entire dataset.

Distribution of the 30 semantic classes across continents. The pie charts reveal a marked thematic imbalance, with “Military life” dominating in most regions except Africa. This visualization underscores regional variations in subject matter within the Forbin dataset. The complete list of classes and their corresponding color codes are provided below.

Number of instances per category of annotated images. The same super category is formed by bars that share the same color.

Annotated photographs from the Forbin dataset. Each image displays polygons corresponding to different annotation categories, with transcribed text linked to each region. The examples highlight the diversity of text layouts, orientations, and typographies captured in the dataset.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/johd.487 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Submitted on: Nov 21, 2025

Accepted on: Feb 17, 2026

Published on: Apr 6, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Historical photographs,

© 2026 Mohamed Chelali, Sylvain-Karl Gosselet, Florence Cloppet, Camille Kurtz, Isabelle Bloch, Daniel Foliard, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1

The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Figures & Tables

Figure 1

Figure 2

Table 1

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Paradigm

My account