Abstract
PicAxe is open-source Python software that researchers can use to extract figures from corpora of PDF files that contain text and images. It is designed to extract figures from corpora that include both scanned and “born-digital” PDF files (structurally heterogeneous) of documents from different cultures and time periods (syntactically heterogeneous). In this paper we describe the functionality of PicAxe and demonstrate its functionality on two corpora. One corpus contains scanned documents that represent the development of the “microbial biofilm” concept from 1929 to 1974. The second corpus contains “born-digital” documents from the academic journal Anthropocene from 2014 to 2023.
