The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Mohamed Chelali; Sylvain-Karl Gosselet; Florence Cloppet; Camille Kurtz; Isabelle Bloch; Daniel Foliard

doi:10.5334/johd.487

The FORBIN Dataset: A Collection of Historical Photographs With Archival Metadata

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Mohamed Chelali , Sylvain-Karl Gosselet , Florence Cloppet , Camille Kurtz , Isabelle Bloch and Daniel Foliard

Open Access

|Apr 2026

Abstract

Early photography and the first massification of the medium at the turn of the 20th century is foundational to present day visual cultures. This is specifically true of the visual archives of the first generation of news picture agencies from the 1900s to the 1920s. The latest generation of artificial intelligence models provides the ability to analyse such photographs beyond their purely pictorial dimension, either by grouping them according to their semantic content or by generating textual descriptions. Cross-seeding approaches between historical epistemologies and computer vision expertise can unlock new perspectives on the computational analysis of large and poorly curated photographic archives to provide new insights on early photographic cultures. The automated detection of visual and textual contents, as well as the exploration of the transformation of continuous-tone photographs into halftone images, can help us understand the complexity and historical variability of systems and structures that created, stored and circulated news photographs. In this paper, we introduce the FORBIN dataset. A Paris-based journalist, Victor Forbin, created his own news picture agency. He bought original prints from around the world and then sold them to major French newspapers from the late 1900s to the 1930s. The FORBIN dataset is composed of 62,135 photographs where both the front and back sides are digitized. This paper includes information of the history of this collection, its digitization and its ensuing transformation into a dataset. The Forbin dataset can be used for various computer vision tasks, such as text recognition, image similarity analysis, caption generation, and metadata extraction.

References

Aissi, M. S., Giardinetti, M., Bloch, I., Schuh, J., & Foliard, D. (2025). Computer vision and halftone visual culture: improving similarity search for historical photographs. Multimedia Tools and Applications. 10.1007/s11042-025-20855-6
Open DOI Search in Google Scholar Back to article
Alkemade, H., Claeyssens, S., Colavizza, G., Freire, N., Lehmann, J., Neudecker, C., … van Strien, D. (2023). Datasheets for digital cultural heritage datasets. Journal of Open Humanities Data. DOI: 10.5334/johd.124
Open DOI Search in Google Scholar Back to article
Antonacopoulos, A., Gatos, B., & Bridson, D. (2007). Page segmentation competition. In International conference on document analysis and recognition (Vol. 2, p. 1279–1283). 10.1109/ICDAR.2007.4377121
Open DOI Search in Google Scholar Back to article
Arnold, T., & Tilton, L. (2023). Distant viewing: computational exploration of digital images (1st ed. ed.). The MIT Press. 10.1093/llc/fqz013
Open DOI Search in Google Scholar Back to article
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (pp. 160–172). 10.1007/978-3-642-37456-2_14
Open DOI Search in Google Scholar Back to article
Chumachenko, K., Männistö, A., Iosifidis, A., & Raitoharju, J. (2020). Machine learning based analysis of Finnish world war II photographers. IEEE Access, 8, 144184–144196. 10.1109/ACCESS.2020.3014458
Open DOI Search in Google Scholar Back to article
Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2017). ICDAR2017 competition on recognition of documents with complex layouts. In International conference on document analysis and recognition (Vol. 01, p. 1404–1410). 10.1109/ICDAR.2017.229
Open DOI Search in Google Scholar Back to article
Colavizza, G., Blanke, T., Jeurgens, C., & Noordegraaf, J. (2022). Archives and AI: an overview of current debates and future perspectives. ACM Journal on Computing and Cultural Heritage, 15(1), 4:1–4:15. 10.1145/3479010
Open DOI Search in Google Scholar Back to article
Denton, E., Hanna, A., Amironesei, R., Smart, A., & Nicole, H. (2021). On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data and Society, 8(2). 10.1177/20539517211035955
Open DOI Search in Google Scholar Back to article
Du, L., Le, B., & Honig, E. (2024, January). Probing historical image contexts: Enhancing visual archive retrieval through computer vision. ACM Journal on Computing and Cultural Heritage, 16(4). 10.1145/3631129
Open DOI Search in Google Scholar Back to article
Efthymiadis, N., Psomas, B., Laskar, Z., Karantzalos, K., Avrithis, Y., Chum, O., & Tolias, G. (2025). Composed image retrieval for training-free domain conversion. In IEEE/CVF winter conference on applications of computer vision (pp. 1723–1733). 10.1109/WACV61041.2025.00175
Open DOI Search in Google Scholar Back to article
Ehrmann, M., Düring, M., Neudecker, C., & Doucet, A. (2022). Computational approaches to digitised historical newspapers. Dagstuhl Reports, 12(7), 112–179. 10.4230/DAGREP.12.7.112
Open DOI Search in Google Scholar Back to article
Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., & Stolz, M. (2010). Ground truth creation for handwriting recognition in historical documents. In IAPR international workshop on document analysis systems (pp. 3–10). 10.1145/1815330.1815331
Open DOI Search in Google Scholar Back to article
Giardinetti, M., Foliard, D., Schuh, J., & Aissi, M.-S. (2024). The EyCon dataset: A visual corpus of early conflict photography. Journal of Open Humanities Data. 10.5334/johd.213
Open DOI Search in Google Scholar Back to article
Gruning, T., Labahn, R., Diem, M., Kleber, F., & Fiel, S. (2018). READ-BAD: A new dataset and evaluation scheme for baseline detection in archival documents. In International workshop on document analysis systems (pp. 351–356). 10.1109/DAS.2018.38
Open DOI Search in Google Scholar Back to article
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In IEEE conference on computer vision and pattern recognition (pp. 2315–2324). 10.1109/CVPR.2016.254
Open DOI Search in Google Scholar Back to article
Gutehrlé, N., & Atanassova, I. (2021). Logical Layout Analysis Applied to Historical Newspapers. In Workshop on natural language processing for digital humanities (p. 85–94).
Search in Google Scholar Back to article
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In IEEE conference on computer vision and pattern recognition (pp. 5020–5029). 10.1109/CVPR.2018.00527
Open DOI Search in Google Scholar Back to article
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Fei-Fei, L. (2017). Visual Genome: Connecting language and vision using crowd sourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73. 10.1007/s11263-016-0981-7
Open DOI Search in Google Scholar Back to article
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems 25: 26th annual conference on neural information processing systems (pp. 1106–1114). 10.1145/3065386
Open DOI Search in Google Scholar Back to article
Lee, B. C. G., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., … Weld, D. S. (2020). The newspaper navigator dataset: Extracting headlines and visual content from 16 million historic newspaper pages in chronicling America. In International conference on information and knowledge management (pp. 3055–3062). 10.1145/3340531.3412767
Open DOI Search in Google Scholar Back to article
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., … Zitnick, C. L. (2014). Microsoft COCO: common objects in context. In European conference on computer vision (Vol. 8693, pp. 740–755). 10.1007/978-3-319-10602-1_48
Open DOI Search in Google Scholar Back to article
Lo, K., Shen, Z., Newman, B., Chang, J. C., Authur, R., Bransom, E., … Soldaini, L. (2023). PaperMage: A unified toolkit for processing, representing, and manipulating visually-rich scientific documents. In Conference on empirical methods in natural language processing (pp. 495–507). 10.18653/v1/2023.emnlp-demo.45
Open DOI Search in Google Scholar Back to article
Papadopoulos, C., Pletschacher, S., Clausner, C., & Antonacopoulos, A. (2013). The IMPACT dataset of historical document images. In International workshop on historical document imaging and processing (pp. 123–130). 10.1145/2501115.2501130
Open DOI Search in Google Scholar Back to article
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In IEEE international conference on computer vision (pp. 2641–2649). 10.1109/ICCV.2015.303
Open DOI Search in Google Scholar Back to article
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (Vol. 139, pp. 8748–8763).
Search in Google Scholar Back to article
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … Jitsev, J. (2022). LAION-5B: an open large-scale dataset for training next generation image-text models. In Advances in neural information processing systems. 10.52202/068431-1833
Open DOI Search in Google Scholar Back to article
Simistira, F., Bouillon, M., Seuret, M., Würsch, M., Alberti, M., Ingold, R., & Liwicki, M. (2017). ICDAR2017 competition on layout analysis for challenging medieval manuscripts. In International conference on document analysis and recognition (pp. 1361–1370). 10.1109/ICDAR.2017.223
Open DOI Search in Google Scholar Back to article
Smits, T., & Ros, R. (2023). Distant reading 940,000 online circulations of 26 iconic photographs. New Media & Society, 25(12), 3543–3572. 10.1177/14614448211049459
Open DOI Search in Google Scholar Back to article
Smits, T., Warner, B., Fyfe, P., & Lee, B. C. G. (2025). A fully-searchable multimodal dataset of the illustrated London news, 1842–1890. Journal of Open Humanities Data. 10.5334/johd.284
Open DOI Search in Google Scholar Back to article
Strauß, T., Leifert, G., Labahn, R., Hodel, T., & Mühlberger, G. (2018). ICFHR2018 competition on automated text recognition on a read dataset. In International conference on frontiers in handwriting recognition (p. 477–482). 10.1109/ICFHR-2018.2018.00089
Open DOI Search in Google Scholar Back to article
Studer, M., Clérot, F., et al. (2023). FINLAM: Finnish language model and dataset for historical newspapers. In Newspapers & magazines ai models: Training and re-use in the digital humanities. Vienna, Austria.
Search in Google Scholar Back to article
van Wissen, L., Vriend, N., Nijssen, A., Vereecken, L., & den Engelse, M. (2025). FAIR Photos – transforming a collection of two million historical press photos into five star data. Journal of Open Humanities Data. 10.5334/johd.271
Open DOI Search in Google Scholar Back to article
Veit, A., Matera, T., Neumann, L., Matas, J., & Belongie, S. J. (2016). COCO-Text: Dataset and benchmark for text detection and recognition in natural images. CoRR, abs/1601.07140.
Search in Google Scholar Back to article
Wang, K., Babenko, B., & Belongie, S. J. (2011). End-to-end scene text recognition. In IEEE international conference on computer vision (pp. 1457–1464). 10.1109/ICCV.2011.6126402
Open DOI Search in Google Scholar Back to article
Wu, S., Schindler, K., Heitzler, M., & Hurni, L. (2023). Domain adaptation in segmenting historical maps: A weakly supervised approach through spatial co-occurrence. Journal of Photogrammetry and Remote Sensing, 197, 199–211. 10.1016/j.isprsjprs.2023.01.021
Open DOI Search in Google Scholar Back to article
Zhang, J., Huang, Q., Liu, J., Guo, X., & Huang, D. (2025). Diffusion-4K: Ultra-high-resolution image synthesis with latent diffusion models. In IEEE computer vision and pattern recognition (pp. 23464–23473). 10.1109/CVPR52734.2025.02185
Open DOI Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.5334/johd.487 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Submitted on: Nov 21, 2025

Accepted on: Feb 17, 2026

Published on: Apr 6, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Historical photographs,

© 2026 Mohamed Chelali, Sylvain-Karl Gosselet, Florence Cloppet, Camille Kurtz, Isabelle Bloch, Daniel Foliard, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1