1 Overview
Repository location
The data is shared openly in Zenodo (Ma & Li, 2025) under the DOI https://doi.org/10.5281/zenodo.15207202.
Context
This dataset was created as part of a funded program by Indiana University Bloomington’s Institute for Advanced Studies (IAS), titled “Empirical Exploration of the Visual Language in Digital Humanities.” Building upon our earlier work on visual inscription use in digital humanities (DH) scholarship (Ma & Li, 2022), the overarching purpose of this project is to empirically analyze how visual representations function in conjunction with textual narratives in DH scholarship to produce knowledge and create a dynamic scholarly communication process.
We define visual representations as any graphs or non-graph illustrations (NGI) in our previous research (Ma & Li, 2022). While graphs encompass a broad range of data visualizations, NGI also includes illustrations such as collages, diagrams, and photographs that appear as figures in research articles. We did not include tables or textual illustrations, which—based on our previous findings—are also commonly used in DH publications (Ma & Li, 2022).
To understand the larger context of how visual representations are used in DH research, we collected three categories of information in this dataset: (1) journal publication metadata, (2) information about the visual representations, and (3) accompanying textual contexts and descriptions of the visual representations. This dataset aims to build a solid basis for further exploration of visual languages in DH work.
2 Method
The creation of this dataset involved the following two major steps:
Step 1: Identify the corpus To create this dataset, we gathered a list of journals considered “exclusively digital humanities” (Spinaci et al., 2022). The complete journal list is available on our GitHub project page (https://github.com/nalsi/dh_visualization). We consulted librarians, journal websites, and relevant scholarly databases to identify four major DH journals from the list, for which we have full-text access and/or permission to extract visual-related content:
Literary and Linguistic Computing (LLC)
Digital Scholarship in the Humanities (DSH)
Journal on Computing and Cultural Heritage (JOOCH)
Digital Humanities Quarterly (DHQ)
It is worth noting that DSH is the result of the rebranding of LLC (Sula & Hill, 2019). In our dataset, they are treated and encoded as two separate journals. However, future researchers are encouraged to link them in order to analyze how this single journal has evolved over time. Three of the four journals, including DSH, LLC, and DHQ, are flagship journals of the Alliance of Digital Humanities Organizations (ADHO), the largest international organization supporting DH research. As a result, these journals are among the best-funded, most well-established, widely read, and canonically recognized in the field. JOOCH, although published by ACM and affiliated with a different institutional network, also benefits from substantial support and broad international readership, particularly among scholars working at the intersection of digital cultural heritage and DH. Given this context, we believe that this curated subsample of the “exclusively DH journals” offers a strong representation of international DH scholarship at a macro level. That said, we acknowledge a potential bias of our dataset: the current version does not include regional or discipline-specific journals that also make significant contributions to DH research. We will discuss this limitation and our plans to address it in more detail in the Reuse Potential section.
We collected all available publications from these four journals to generate the largest possible sample. Our final sample contains 3,274 unique publications from these journals. A more detailed description of the final sample is available on our GitHub project page: https://github.com/nalsi/dh_visualization.
Step 2: Extract visual-related information from the full-text We used full-text publications to identify and extract visualization-related information. The publications in our corpus are available in XML format for DHQ, and in HTML format for LLC, DSH, and JOOCH. The extraction focused on three key components for each figure: figure metadata (e.g., IDs and image links), figure captions, and in-text references or descriptions. To extract this information, we manually examined the article structure of each journal to identify the relevant tags and tailored the extraction pipeline accordingly. Although minor adjustments were required for each journal, the overall pipeline consists of several modular components designed to accommodate diverse formats and address various extraction challenges, as detailed below. While the Python code is not included in the current version of the dataset, it will be made available in the next update, scheduled for release in Summer 2025.
XML Data Extraction: The Data Extraction From XML.py module utilizes the lxml library from Python to parse XML documents. It employs XPath queries to extract key elements such as title, author information, publication date, volume, issue, and keywords. Figure references are extracted through both direct XML tag navigation and regular expression matching, ensuring that indirect references (e.g., “Figures 2–4”) are accurately captured. Extracted data is saved to CSV files, ensuring incremental progress and data integrity.
Positional Reference Extraction: The Extracting ref from position.py module targets the extraction of in-text descriptions of the graphs within the full texts (i.e., the sentences in which a graph is mentioned). By analyzing adjacent <p> tags using XPath selectors, the module maps figures to their descriptive paragraphs, thereby enriching the visual metadata.
HTML Data Extraction and Web Scraping: HTML pages are processed via two complementary approaches. The DataExtraction.py module uses BeautifulSoup to parse static HTML content and extract metadata from embedded JSON-LD scripts. Meanwhile, the FetchHtml.py module leverages Selenium WebDriver to handle dynamic content and authentication, downloading HTML content for subsequent parsing.
Image Metadata Extraction from HTML: Two scripts, ImageExtraction.py and ImageExtraction copy.py, extract image URLs, captions, and reference paragraphs from HTML elements. These scripts also incorporate mapping mechanisms to associate images with their corresponding article identifiers.
Data Cleaning and Integration: Finally, the Text Data Cleaning.py module sanitizes the extracted text by removing extra whitespace, newline characters, and other formatting artifacts. This standardization ensures that the data is clean and consistent for further analysis. We further used R (R Core Team, 2024) to systematically integrate and standardize data collected from different journals, by cleaning author, institution, and DOI formats.
The modular architecture of our extraction pipeline was developed with robustness, scalability, and reproducibility in mind. These key considerations are explained below:
Robustness: Each module implements error handling (e.g., try/except blocks) to ensure that failures in one part of the pipeline do not disrupt the entire process.
Scalability: By compartmentalizing tasks (XML, HTML, and web automation), the system can be easily extended to handle additional data sources or adapted to emerging extraction challenges.
Reproducibility: Configuration parameters such as file paths, API tokens, and URLs are managed through a dedicated configuration file, ensuring the extraction process can be reliably reproduced across different environments.
3 Dataset Description
Repository name
The dataset is openly shared in Zenodo (Ma & Li, 2025).
Object name
Our final dataset contains three separate and interconnected data files using a SQL database structure:
metadata.csv: This file contains the metadata information about each publication extracted from the data sources, including the following columns. All metadata information was directly extracted from full-text publications. We did not incorporate additional metadata from scholarly databases in this version but plan to do so in a future release of the dataset.
– id: The unique publication ID used in this dataset, such as “LLC-1.”
– title: The publication title.
– date: The publication year.
– authors: The authors of the publication. If a paper has more than one author, the format is “[author1]; [author2]”.
– affiliations: The affiliations of the publication. If a paper has more than one affiliation, the format is “[affiliation1]; [affiliation2]”.
– doi: The DOI of the publication, if available.
– journal: The journal name of the publication.
graph.csv: Each row of the file represents a unique graph presented in a publication.
– id: The unique publication ID used in this dataset.
– image id: The unique ID of each graph.
– caption: The caption of the graph.
text.csv: Each row of the file represents a unique mentions of a graph in a publication.
– image id: The unique ID of graphs.
– image text: The sentence in which a graph is mentioned in a publication.
Format names and versions
All data files use the CSV format to ensure maximum accessibility. The file format is guided by the SQL database principles, and it can be easily imported into an SQL database management system. The ER diagram of the database is illustrated in Figure 1.

Figure 1
ER Diagram of the Dataset.
Creation dates
2023-02–2025-03
Dataset creators
Dr. Rongqian Ma (Indiana University Bloomington): designing research and database structure
Dr. Kai Li (University of Tennessee, Knoxville): designing research and database structure; data cleaning
Ranvir Singh Virk (Indiana University Bloomington): data collection
Sagar Prabhu (Indiana University Bloomington): data collection
Dylan Gorman (University of Tennessee, Knoxville): data collection
Language
The dataset is in English.
License
The dataset is shared under the Creative Commons CC0 (CC Zero) license.
Publication date
The dataset was published on 2025-04-01.
4 Reuse Potential
This dataset was collected and released to support ongoing and future research on data visualization in the DH scholarship—and more broadly, in the humanities. Visual representations can serve as a critical lens for understanding the landscape of DH and its relationship with data science and humanities research, as they become an increasingly important mode of inquiry and communication in DH (Ma & Li, 2022; Münster & Terras, 2019). While previous research has explored this topic using case studies (Jänicke et al., 2015; Ma et al., 2021) or small-scale samples (Ma & Li, 2022), there remains a lack of a comprehensive and/or representative dataset dedicated to visual representations in DH. This gap reflects the overall difficulties of indexing DH publications in regular research databases, due to the DH field’s interdisciplinary nature and its strong roots in the humanities (Li et al., 2024; Spinaci et al., 2022).
Against this backdrop, we envision this novel dataset as a valuable resource for DH and scholarly communication researchers interested in understanding the role of visual representations in academic discourse. Specifically, the dataset can be used to investigate questions such as: (1) How are different types of visual representations used in DH publications? (2) How does the use of such visuals evolve across time, and vary according to disciplinary traditions or journal conventions and standards? (3) What rhetorical functions do visuals serve in DH publications? For instance, what is the relationship between the complexity of visualizations and their explanatory captions or in-text descriptions, if any? What are the stylistic and structural patterns in the captions and surrounding texts of visuals? (4) Does the use of visuals relate to article impact (e.g., article citations)? We hope this dataset can be further linked to other datasets, particularly enriched publication metadata from sources like OpenAlex and author-level information from ORCID, to demonstrate broader value beyond the questions outlined above. This critical research agenda will also be foundational to future interdisciplinary collaboration between researchers in humanities (and DH), science and technology studies, and computational fields, as visual communication is a shared interest in all these communities (Ma & Li, 2022).
We also anticipate that this dataset will serve as a valuable teaching resource in DH, the broader humanities, and information sciences. It can support students in: (1) learning about visualization techniques, (2) establishing an understanding of how to design and present visualizations in the context of DH scholarship, and (3) practicing data analysis skills using this well-curated domain-specific dataset. The authors plan to incorporate the dataset into courses they teach at their respective institutions, including INSC 592 (Introduction to Data Analytics) and 590 (Introduction to Data Visualization) at the University of Tennessee, Knoxville, and ILS-Z637 (Information Visualization) at Indiana University Bloomington.
We acknowledge that our dataset is based on a limited sample, both in terms of the selection method and the venues included. First, we focused on DH journals that are well-funded with international readership and did not include smaller, regional, or domain-specific ones. Second, not all DH-relevant publications appear in the “exclusively DH journals,” as noted in our previous findings (Li et al., 2024). Third, journals represent only one of the many venues where DH researchers disseminate their work. Books and conference proceedings, for example, are not covered by our sampling approach, although both of them play a major role in the DH field. For these reasons, we do not claim that our dataset provides a comprehensive representation of the entire DH community. Rather, it should be understood as a focused subsample of journal publications within this community. While we believe this subsample constitutes a meaningful starting point for building a high-quality DH visual representations dataset, we also recognize its inherent biases. The current dataset may underrepresent work published in smaller or discipline-specific journals (e.g., The Journal of the Text Encoding Initiative) and skew toward visual practices more commonly shared among DH scholars with STEM affiliations, who are more likely to publish in journals than in monographs, for instance.
To address the above limitations, a key step is to expand the dataset to offer a more comprehensive view of DH visual representations. In future work, we plan to include more domain-specific journals such as The Journal of the Text Encoding Initiative and Journal of Cultural Analytics to help mitigate existing biases and enhance the representativeness of the dataset. We also intend to include conference publications indexed by the Index of DH.
Conferences (Retrieved 2025-06-04 from https://dh-abstracts.library.virginia.edu/), which will support cross-corpus comparisons and broaden the scope of the dataset. Additionally, we plan to enhance the dataset with richer metadata. For example, we plan to perform basic descriptive text analysis of the full-text graph discussions, such as keyword extraction and topic modeling, to add keywords and topic metadata to the dataset, which will make the dataset more accessible and useful for users with varying levels of technical expertise. We also plan to develop an interactive dashboard that showcases the dataset, enabling users to explore it from multiple perspectives in real time.
Acknowledgements
The authors would like to thank Ranvir Singh Virk, Sagar Prabhu, and Dylan Gorman for assisting with data collection for this project.
Competing interests
The authors have no competing interests to declare.
Author roles
Rongqian Ma: Conceptualization, Resources, Methodology, Investigation, Funding Acquisition, Writing – original draft, Writing – review & editing, Supervision, Project administration; Kai Li: Conceptualization, Resources, Methodology, Investigation, Software, Formal Analysis, Funding Acquisition, Writing – original draft, Writing – review & editing; Ranvir Singh: Methodology, Investigation, Software, Writing – original draft.
