Have a personal or library account? Click to login
Flora Batava (1800–1934): From Historical Citizen Science to Plant Humanities Dataset Cover

Flora Batava (1800–1934): From Historical Citizen Science to Plant Humanities Dataset

Open Access
|Jan 2026

Full Article

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.17713394

Context

Recent efforts in the digitization of biodiversity data have enabled a wave of innovative research inquiries that far extend the purpose for which these data were initially collected. An increasing number of humanities scholars are also utilizing digitized biodiversity data to analyse topics such as social biases in specimen collection (e.g., Groom et al., 2014), impact of colonial legacies (e.g., Morales Furuta et al., 2025), and changes in the documentation of botanical knowledge (e.g., Smail et al., 2021). However, due to their primary focus on biological sciences, biodiversity-related datasets often lack additional information that might be valuable for scholars in other fields. Here we present a newly compiled dataset based on the first illustrated flora of the Netherlands, the Flora Batava series published between 1800 and 1934. To facilitate research integration, this set of floristic data was compiled from a humanities perspective, especially considering ethnological and archival research (Ashby, 2024; Jacobs & Griebeler, 2025), thus providing an opportunity to analyse historical flora from a wider perspective.

Compilation of this dataset is part of the project Botanical Records through a Social Lens, which explores biases and perceptions in historical citizen science. Unlike other floras of the period, the Flora Batava relied heavily on observations made by people other than the series editors. In addition to their observations, the public was also encouraged to contribute information such as local common names and uses for each species. Our analysis shows that over 400 observers provided data on species occurrence during the 134-year publication period. Despite difficulties in uncovering detailed biographic information about the observers, we managed to disambiguate the names and identify the sex of over 85% of all people involved in the Flora Batava. Additionally, the dataset also lists the date and place of publication, as well as the name of the responsible editor for each of the 28 volumes that comprise the series. Regarding the plants, fungi, and algae described in the original material, the dataset includes both currently accepted botanical nomenclature, as well as the common and Latin names under which the species were mentioned in the original publication.

(2) Method

Steps

Scanning: All pages in each of the volumes comprising the Flora Batava were scanned in 2021–2022 by Groenedijk Microfilm- en Scanservice (GMS) by order of the KB. Scanning was done according to the guidelines of Metamorfoze,1 the national Dutch program for the preservation of paper heritage.

Botanical identification and nomenclature: Digitized files were distributed to a group of experts in plant and fungal taxonomy who were consulted to check species identifications and update the nomenclature where needed (see Dataset Creators). For detailed information on the process, see van Gelder and Peeters (2023).

Handwritten text recognition (HTR): Although the source material is not handwritten, differences in typesetting among volumes made HTR a preferred mode of digitization. HTR was then performed with Loghi, an end-to-end framework for historical document processing (van Koert et al., 2024). Using PageXML ground truth from sample pages, we trained Loghi’s layout-analysis and text-recognition models on the specific fonts and printing artefacts of the corpus. We then applied the trained models to the full set of scans to produce machine-readable PageXML.

Text segmentation: Text from each page containing a species’ description in Dutch was segmented using a generative AI model (OpenAI’s GPT-4) to automatically identify and label different sections. The model was prompted as an “expert system” in historical botanical texts and instructed to label from a predefined list: “species names” (Figure 1a), “flowering time” (Figure 1b), “classification” (Figure 1c), “sexual characteristics” (Figure 1d), “species traits” (Figure 1e), “habitat” (Figure 1f), “medicinal use” (Figure 1g), “domestic use” (Figure 1h). The model received the plain text as input and returned JSON files with each section paired with its assigned label (Supplementary File 1). The benefit of this procedure was consistent segmentation despite variation in historical terminology and formatting among volumes.

johd-12-497-g1.png
Figure 1

Example double page of the Flora Batava series including illustration and text sections. Letters indicate the different segments of the text: a) “species names”, b) “flowering time”, c) “classification”, d) “sexual characteristics”, e) “species traits”, f) “habitat”, g) “medicinal use”, h) “domestic use”.

Data extraction: Generative AI was also used to extract observational data from each segmented habitat section. Here, the model was prompted as an NLP system specialized in historical Dutch and instructed to identify plant observations together with their associated elements: (i) geolocatable place names, (ii) habitat descriptions, (iii) observers, and (iv) dates exactly as phrased in the source. Detailed rules were formulated to allow the model to focus on Dutch localities, distinguish between true locations and generic landscape terms, split compound location references, and finally group all information belonging to a single observation (Supplementary File 2). Results were again returned as JSON.

Geocoding: Nominatim, a tool to search OpenStreetMap by name and address was combined with the R package tidygeocoder to assign geographical coordinates to the localities described in the source material. Toponyms that could not be automatically geocoded were edited manually or with the aid of generative AI model (GPT-4).

Data enrichment: Information about the botanical family and general division to which each species belongs was gathered from the Taxonomic Name Resolution Service (Boyle et al., 2021) and Mycobank (Robert et al., 2005). Name disambiguation and background information on the observers involved in the Flora Batava was done by creating a mapping to a reference list of Dutch botanists compiled by Floristic Research Netherlands (FLORON). This resulted in a unique identifier for each person, similar to the procedure followed by Sparrius et al. (2019). Finally, data on the publication year, editor, and publisher was compiled from the cover page of each volume in the Flora Batava series.

Quality control

During scanning, quality check of both images and metadata were carried out by GMS and KB teams. Botanical identifications were checked against earlier datasets, including that of Duistermaat et al. (2021). When depicted plants or fungi could not be attributed to a specific taxon at (infra)specific level, a higher taxonomic rank was used. Text segmentation quality was assessed by manually reviewing a random sample of entries (N = 100). After data extraction, all entries were individually checked against the source material and manually corrected if needed regarding spellings and correctness of the information. Entries were again manually checked during geocoding and data enrichment.

(3) Dataset Description

Repository name

Zenodo

Object name

Flora-Batava-people-plants-location

Format names and versions

CSV

Creation dates

Creation of this dataset began with the digitization of the source material in 2021-08-16 and was concluded with data enrichment in 2025-10-30.

Dataset creators

In addition to the authors of this manuscript, Rutger van Koert (KNAW Huygens Institute) contributed to the HTR step. Botanical identification and nomenclature check was carried out in the context of the Flora Batava facsimile edition published before the preparation of this dataset (van Gelder and Peeters, 2023). The effort was coordinated by Leni Duistermaat (Naturalis Biodiversity Centre), Gerda van Uffelen (Universiteit Leiden), Anastasia Stefanaki (Utrecht University), Eddy Weeda, and Anneke van der Putte (Nederlandse Mycologische Vereniging). A full list of contributors is cited in van Gelder and Peeters (2023).

Language

Variable names, as well as most variable states are presented in English. Five variables – original common name, location, location remarks, habitat, and observation notes – are in Dutch, as they contain verbatim text from the Flora Batava. Current species names follow the International Code of Nomenclature for Algae, Fungi, and Plants (Turland et al., 2025), while 19th -century names are listed verbatim.

License

The data has been deposited under open license CC4.0.

Publication date

2025-11-25.

(4) Reuse Potential

The Flora Batava dataset offers a wide range of reuse opportunities beyond the more obvious application in ecology research. Particularly, this dataset is well suited for analysing the formation of botanist networks and the participation of women. As it already aggregates information about the plant observers, researchers reusing this dataset can avoid the time-consuming step of disambiguating people’s names (Groom et al., 2014, 2022). The inclusion of old Latin and common names allows tracking name changes and variations to uncover links between plants and local cultures, as well as history of botanical taxonomy. Finally, similar to digitized herbarium data, this dataset can serve as material for science outreach activities such as hackathons (Meeus et al., 2021), and artistic practices such as storytelling (Jacobs & Griebeler, 2025), which can be combined with digital visualization methods to imagine the work and experiences of plant observers over the past two centuries.

Data Accessibility statement

The dataset described here is fully available through an open repository. The prompts used to segment and extract the data are also available as supplementary files.

Additional Files

The additional files for this article can be found as follows:

Supplementary File 1

Final version of the prompt for data segmentation using a generative AI model (GPT-4). DOI: https://doi.org/10.5334/johd.497.s1

Supplementary File 2

Final version of the prompt for data extraction using a generative AI model (GPT-4). DOI: https://doi.org/10.5334/johd.497.s2

Notes

[1] Retrieved December 08, 2025 from www.metamorfoze.nl.

Acknowledgements

The authors acknowledge the contribution of the many experts who provided species identification, as well as Rutger van Koert (KNAW Huygens Institute) for his work on the HTR. We are also thankful to Rebeca Ibáñez Martín (KNAW Meertens Institute), Koen Verhoeven (Netherlands Institute of Ecology), Geert Buelens (Utrecht University), and Els Stronks (Utrecht University) for their consultative support in the project Botanical Records Through a Social Lens.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Luiza Teixeira-Costa: Conceptualization, Data curation, Investigation, Software, Validation, Writing – original draft, Writing – review & editing.

Esther van Gelder: Conceptualization, Funding acquisition, Resources, Writing – review & editing.

Laurens Sparrius: Data curation, Funding acquisition, Resources, Writing – review & editing.

Folgert Karsdorp: Conceptualization, Data curation, Funding acquisition, Methodology, Project administration, Software, Validation, Writing – review & editing.

DOI: https://doi.org/10.5334/johd.497 | Journal eISSN: 2059-481X
Language: English
Submitted on: Dec 8, 2025
|
Accepted on: Dec 18, 2025
|
Published on: Jan 9, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Luiza Teixeira-Costa, Esther van Gelder, Laurens Sparrius, Folgert Karsdorp, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.