Have a personal or library account? Click to login
PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files Cover

PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Open Access
|Dec 2025

Abstract

PicAxe is open-source Python software that researchers can use to extract figures from corpora of PDF files that contain text and images. It is designed to extract figures from corpora that include both scanned and “born-digital” PDF files (structurally heterogeneous) of documents from different cultures and time periods (syntactically heterogeneous). In this paper we describe the functionality of PicAxe and demonstrate its functionality on two corpora. One corpus contains scanned documents that represent the development of the “microbial biofilm” concept from 1929 to 1974. The second corpus contains “born-digital” documents from the academic journal Anthropocene from 2014 to 2023.

DOI: https://doi.org/10.5334/jors.574 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 28, 2025
Accepted on: Dec 1, 2025
Published on: Dec 16, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Anna C. Guerrero, Krishna Kamath, Qilin Zhou, Bruno Felalaga, Julia Damerow, Aaron R. Dinner, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.