PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Anna C. Guerrero; Krishna Kamath; Qilin Zhou; Bruno Felalaga; Julia Damerow; Aaron R. Dinner

doi:10.5334/jors.574

Full Article

(1) Overview

Introduction

Researchers in all disciplines are increasingly using computational tools to define the features of images and to find patterns common to groups of images [1, 2, 3, 4, 5, 6]. One reason for this trend is the large and growing amount of digital image data available on the internet, including webpages, social media, and digital libraries and archives. Another reason is increased interest in and accessibility to machine learning tools for image description and analysis. Researchers seeking to analyze images with computers need tools to collect and organize digital images and relevant metadata.

Many of the images that researchers want to study exist as figures in digital PDF files of documents that contain both text and figures, including newspapers and magazines, letters, academic journal articles, and textbooks. Researchers can manually extract figures from PDF files with cropping software, but doing so for more than a few hundred figures is prohibitively time consuming. Moreover, mislabeling and metadata errors are more common during manual extraction. For these reasons, automated figure extraction is desirable.

Automated figure extraction is not a straightforward process, especially if researchers want to extract figures from a large corpus of PDF files that are structurally and syntactically heterogeneous with respect to figures.

Structural heterogeneity: A corpus of PDF files can be structurally heterogeneous with respect to figures depending on how figures are embedded within individual PDF file structures. In many cases, older documents exist as digital scans or photographs of hard-copies. Each page of these documents is often encoded as a single image within the PDF file structure; text and figures are not embedded separately. For these cases, existing free, automated image extraction tools output each page as an image; such results are not useful for figure analysis (Figure 1, Case 1). Newer documents are frequently generated and disseminated as “born-digital” PDF files in which figures are embedded separately from text within the PDF syntax as XObjects. Automatically extracting individual XObjects from these PDF files is straightforward, and existing software can successfully perform this task (Figure 1, Case 2). However, there are cases in which single figures are inconspicuously embedded as multiple XObjects such that existing tools extract single figures in fragments. These fragmented outputs generated by existing free tools are not suitable for researchers who want to extract and analyze whole figures (Figure 1, Case 3). A structurally heterogeneous corpus will include PDFs with all three cases.

**Current free automated figure extractors will perform poorly on structurally heterogeneous corpora:** Figure extraction results from PyMuPDF [7]. Case 1: Individual figures are not embedded within the PDF structure; Case 2: Whole figures are embedded within PDF structure; Case 3: Incomplete parts of single figures are embedded within PDF structure.

Syntactic heterogeneity: A corpus of PDF files can be syntactically heterogeneous with respect to figures when documents from different types of sources, cultures, and time periods use variable syntax for document content, including layout, text, and figures. Computer scientists are actively improving document layout and figure detection in text-image environments using machine learning [8, 9, 10]. However, these tools are generally trained on corpora of PDF files with syntactically homogeneous content (e.g., the layout, text, and figures in PDF files of 21st century born-digital articles from scientific journals have a more similar syntax compared to those of scanned documents published before the 21st century). A syntactically heterogeneous corpus will include PDFs with a greater variety of content syntax. It is not clear that figure extraction algorithms trained on corpora with syntactically homogeneous content work for large corpora with syntactically heterogeneous content. Training an algorithm to identify figures is more difficult if there are greater varieties of layout, text, and figure syntax across documents. Additionally, age and method of digital preservation can affect text and figure data through warping, tilting, reducing contrast, or introducing aberrations; these issues represent additional cases for an algorithm to manage. If researchers want to extract figures from a corpus of PDF files with syntactically heterogeneous content, it is more likely that their corpus will include cases that an algorithm has never encountered, increasing the risk of false positive or false negative extraction results.

Here, we present PicAxe, open-source Python software for automated figure extraction from structurally and syntactically heterogeneous PDF corpora. Specifically, we present two pipelines: PicAxe-YOLO and PicAxe-OCR. PicAxe-YOLO performs figure recognition and extraction with two pretrained YOLOv8 [11] models. PicAxe-OCR identifies and eliminates text using Paddle-OCR [12] and then extracts the remaining content.

In Section 1.5, we present and discuss performance results for PicAxe-YOLO and PicAxe-OCR for two corpora. The first corpus, the Biofilm corpus, includes 308 scanned PDFs of scholarly publications originally published from 1929 to 1974. This corpus, which consists of publications from many different journals and books, contains figures like hand-drawn line graphs and diagrams, tables generated with typewriters, and photographs made with cameras and microscopes; of the two corpora, the Biofilm corpus has greater syntactic heterogeneity. The second corpus, the Anthropocene corpus, includes 100 born-digital PDFs published in the scholarly journal Anthropocene from 2014 to 2023. This corpus contains documents with homogeneous layouts, and includes figures like computer generated diagrams, graphs, maps, tables, and photographs, all embedded as vector graphics.

Both PicAxe pipelines are available on GitHub [13]. Documentation are also available on Github. To make PicAxe accessible to researchers who lack time or programming skills necessary to understand and implement figure extraction code, we will also integrate PicAxe into the Giles Ecosystem, a distributed web application that performs data extraction from PDF files [14]. Researchers utilizing PicAxe through Giles will be able to upload their PDF corpus and download extracted figures without seeing code.

Functionality

At a high level, PicAxe identifies figures within text-image PDF files and returns the figures as PNG files. Users may choose between two PicAxe pipelines, PicAxe-YOLO and PicAxe-OCR. Both pipelines consist of multiple stages to enhance the accuracy of extraction results (Figure 2). Images are returned at a resolution of 72 pixels per inch by default. The process by which each pipeline identifies figures is different; the details of stages unique to each pipeline are covered in Section 1.3 of this article.

With only two main stages, PicAxe-YOLO is a good choice for researchers who want an extraction process with a shorter overall run-time. PicAxe-YOLO also has an optional post-processing stage that removes some false positive results. PicAxe-YOLO has three operational modes: figure-sensitive, table-sensitive, and combined. These modes were added because tables were a primary cause of extraction error in both pipelines. Users can select any of these modes to optimize extraction results, depending on whether their focus is on figures, tables, or both.

PicAxe-OCR was developed to investigate if accurate unsupervised figure extraction was possible without image training data beyond pretrained optical character recognition (OCR). PicAxe-OCR takes longer to run on the same sized corpus as PicAxe-YOLO. For corpora where PicAxe-YOLO produces inaccurate results, researchers might try PicAxe-OCR as it does not rely on image training data.

Implementation and architecture

PicAxe starts by converting each page of an input PDF file to a PNG file. This step reduces the amount of variation in PDF syntax, helping ensure that PicAxe can handle a corpus that includes both born-digital and scanned PDF files in which figures or parts of figures are encoded as XObjects, as well as PDF files of scanned documents that do not have separately embedded figures.

Next, PicAxe removes any borders generated during hard-copy document scanning from the PNG files of the PDF pages since these extraneous borders impede accurate figure extraction in both pipelines. Both PicAxe pipelines use these pre-processed PNG files as input and extract figures from them using one of two user-selected methods:

PicAxe-YOLO uses one pretrained object segmentation model and fine-tunes an object detection model to identify and extract tables first and then remaining figures, including images like illustrations, photographs, and graphs.
PicAxe-OCR uses PaddleOCR to detect and remove text. It implements a confidence level threshold, ensuring that only text elements exceeding this confidence level threshold are removed. Then, PicAxe-Paddle applies a combination of Otsu’s method [15] for thresholding and Sobel edge detection [16], following multiple dilation iterations, to identify bounding boxes of the remaining marks. Boundaries with smaller areas are filtered out, ensuring the PicAxe-OCR approach is noise-resistant. Finally, PicAxe-OCR uses these boundaries to crop the desired pixels from the original PNG files.

We now describe each method in more detail.

PicAxe-YOLO

Stage 0: Model Training. We adopted a medium-sized YOLOv8 model, comprising approximately 26 million parameters, to develop specialized figure-sensitive and table-sensitive models for document analysis. The figure-sensitive YOLOv8 model was initially trained on the ScanBank dataset, which consists of 8,146 images [17] and an additional 3,670 geoscience images [18], following an 80/20 train/validation split. This model achieved a precision of 0.872, a recall of 0.904, and an mAP50 of 0.938, demonstrating robust performance in figure identification within the training dataset.

To develop a table-sensitive model, we utilized the D4 LA dataset [19], which contains 11,093 images and was also divided using a 80/20 train/validation split. To enhance the accuracy and efficiency of the table-sensitive model, we employed a transfer learning approach by fine-tuning the backbone weights of the figure-sensitive YOLOv8 model on the D4 LA dataset. This strategy enabled the table-sensitive model to leverage the pre-existing knowledge from the figure detection task, resulting in a precision of 0.898, a recall of 0.861, and an mAP50 of 0.920.

There were inherent dataset augmentations present in both training datasets, including scanning noise, perspective transformations, additional saturation, and mosaic effects which should improve PicAxe-YOLO’s performance on syntactically heterogeneous corpora. The integration of the figure-sensitive and table-sensitive models aimed to maximize the recall probability for figures compared to tables.

Stage 1: Page Segmentation. Given pre-processed PNGs of pages, PicAxe-YOLO applies a content segmentation module so users can isolate content of interest before applying the object detection models in Stage 2. The content segmentation module leverages Eynollah [20] with pretrained weights optimized for both full region and table segmentation, facilitating the removal of artifacts that can interfere with accurate table detection. Initially, a general layout model is employed to delineate the primary contours of the pages, followed by table-specific segmentation to mitigate the risk of excluding relevant tables and to prevent excessive cropping. Since Eynollah’s segmentation can overcrop and remove desired content, PicAxe retains segmented regions that cover at least 65% of the original image area.

Stage 2: Border Removal and Object Detection. After scan borders are removed using a mean pixel density thresholding technique, the second stage utilizes two pretrained YOLOv8 models that operate concurrently to detect figures and tables. The bounding boxes detected by the models are dilated and merged to create a unified output when connected. These YOLOv8 models are specifically trained to detect a diversity of figures and tables; one model is optimized for greater sensitivity to tables, while the other is fine-tuned to improve sensitivity to figures.

The motivation for employing two YOLOv8 models stems from the need to address challenges presented by existing annotated datasets. The ScanBank dataset (for figures) and D4 LA dataset (for tables) were built to address challenges associated with scanned PDF documents of poor quality, such as random affine rotations, salt-and-pepper noise, and perspective distortion, making them suitable for training object detection models for scanned documents. However, discrepancies exist in how these datasets label certain table-like sections. For example, ScanBank labels document sections such as “Table of Contents” as images, whereas D4 LA labels these sections as “RegionList.” To address this, the geoscience dataset was integrated with ScanBank to enhance sensitivity to figure diversity. To mitigate the noisy label problem, the geoscience dataset, containing well-defined and non-ambiguous figure labels, was integrated with ScanBank to enhance sensitivity to figure diversity. This approach aligns with the Enhanced Meta Label Correction (EMLC) framework [21], which leverages a small, clean auxiliary dataset to correct noisy labels and improve model generalization.

Stage 3 (optional): Post-processing. Users can choose to employ post-processing to filter out false positive figures or tables, including barcodes, columnar texts, equations, and images with irregular aspect ratios. This process utilizes Pyzbar [22] for barcode detection and removal, along with contour area and aspect ratio restrictions.

PicAxe-OCR

Stage 1: Conversion to PNG files and Preprocessing. After each page of an input PDF is converted into high-resolution PNG files using the pdf2image library [23], PicAxe-OCR crops out scan borders that surround document content; this step removes dark page edges in scanned files. Cropping is achieved through a series of image processing techniques including thresholding, contour detection, and morphological operations like dilation and erosion. These initial steps improve the quality of subsequent operations.

Stage 2: Table Detection and Removal. Once the borders are cropped, PicAxe-OCR uses LayoutParser [24] to detect tables and tabular figures. Tables and tabular figures are removed by replacing their bounding areas with a white background. Extracted tables and tabular figures are delivered as output with extracted figures in post-processing.

Stage 3: Text Detection and Removal Using PaddleOCR. With tables and tabular figures removed, PicAxe-OCR detects and removes remaining text elements using PaddleOCR, an optical character recognition (OCR) tool. We apply a confidence threshold to ensure that only text detected with a high confidence level is removed.

Text detection is applied until the model cannot detect any more text, or the number of iterations reaches the maximum number of passes (the default is five), which can be customized by the user. Multiple OCR passes usually improve figure extraction results because sometimes as the model removes text, additional text becomes more apparent. The majority of execution time is spent on this stage.

Stage 4: Select Target Images. PicAxe-OCR further processes the text-free PNGs to identify target image regions. First, it applies Sobel convolutions in different directions to highlight potential regions of interest based on rapid changes in pixel intensity; such regions generally correspond to the edges of an image [16]. Those regions are combined to form single images with highlighted edges.

Next, it applies morphological operations important for extracting faint images with broken edges. These operations include erosion, which helps make dark marks larger to connect broken edges, canny edge detection to identify all edges, and another erosion round to connect edges further. Contours are detected and filtered according to size to remove small, unconnected marks; these marks are usually noise and not part of a larger target object as prior steps could not connect them to larger regions.

Then, the remaining contours are converted to bounding boxes and merged by the degree of overlap. A very low intersection-over-union threshold is used to ensure that bounding boxes with relatively small overlaps are collected together. This step is essential for handling cases where an image is divided across multiple regions, as in the case of subfigures or multipart images.

Last, the final bounding boxes are used to extract potential target figures from the original PDF image. These target figures are placed on a white mask so that when they are passed to Stage 5, all marks of interest (even those that were removed for size) are included in the final extracted figures.

Stage 5: Final Figure Selection and Post-processing. In this final stage, PicAxe-OCR reapplies operations from Stages 3 and 4 with modified parameters to ensure only relevant figures are extracted and saved. Another round of text detection and removal is conducted, followed by morphological operations like erosion, canny edge detection, and dilation. In this stage, the dilation uses a larger number of iterations to ensure that regions of interest are larger and connected so they can be extracted together. Smaller dilated regions are filtered out. Contours are converted to bounding boxes and extracted. Final extracted figures get padded slightly and saved with descriptive filenames. Processed files are logged to prevent redundancy and facilitate tracking.

Quality control

For both PicAxe pipelines, we performed functional testing, load testing, and crossplatform testing using Python3 3.10 for PicAxe-YOLO and Python 3.10.12 for PicAxe-OCR.

Given the complexity of figure recognition and extraction tasks, we recommend that users always manually verify and clean their PicAxe extraction results. Incomplete positive, false positive, and false negative results are possible. These results are discussed in more detail in 1.5.

Performance

We scored the performances of (1) PicAxe-YOLO (combined mode), (2) PicAxe-OCR (Paddle), and (3) an earlier version of PicAxe-OCR that implemented pytesseract [25] for OCR tasks against manual extraction results for two corpora: the Biofilm corpus and the Anthropocene corpus.

The Biofilm corpus includes 308 scanned PDFs of scholarly documents originally produced and published from 1929 to 1974. Compared to the Anthropocene corpus, the Biofilm corpus is syntactically heterogeneous: the documents in this corpus are of variable type (book chapters, dissertations, and journal articles) and were produced over a broader time range, and thus contain many different document layouts and figure styles compared to the Anthropocene corpus. Figures include cases of hand-drawn line graphs and diagrams, tables generated with typewriters, and photographs made with film cameras and microscopes.

The Anthropocene corpus includes a random sample of 100 born-digital PDFs published in the scholarly journal Anthropocene from 2014 to 2023. Compared to the Biofilm corpus, the Anthropocene corpus is syntactically homogeneous: document layout is consistent across PDFs as the journal did not modify their document layout significantly over the ten years. Figures include cases of computer generated vector graphics of diagrams, graphs, maps, and tables, as well as digital photographs.

We measured figure extraction performance for each PicAxe version. First, we measured corpus-level extraction performance by calculating the percent difference in total number of images extracted by PicAxe compared with manual extraction (Table 1). To measure image-level extraction performance, we visually compared the images that PicAxe extracted to manually-extracted ground-truth figures; we identified true positives, false positives, and false negatives (Table 2). We further split true positives into two categories: complete and incomplete, and false positives into two categories: acceptable and unacceptable, to give a more detailed description of the kinds of extraction results PicAxe may produce (Figure 3). We used image-level counts to calculate precision, recall, and F1 statistics for both corpora (Table 3).

Table 1

Total images extracted: Corpus-level extraction performance evaluated by total number of images extracted (# images) compared to total number of images extracted during manual extraction (% change).

	BIOFILM		ANTHROPOCENE
	# IMAGES	% CHANGE	# IMAGES	% CHANGE
Manual	2748	–	842	–
PicAxe OCR (pytesseract)	3126	+13.76%	962	+14.25%
PicAxe OCR (Paddle)	1773	–35.48%	1273	+51.19%
PicAxe YOLO	2364	–13.97%	1364	+61.64%

Table 2

Positives and negatives: Image-level extraction performance evaluated by number of images extracted compared to manual extraction. Complete true positives (TPc), incomplete true positives (TPi), acceptable false positives (FPa), unacceptable false positives (FPu), and false negatives (FN).

	BIOFILM
	TPC	TPI	FPA	FPU	FN
Manual	2748	–	–	–	–
PicAxe OCT (pytesseract)	1626	19	111	1370	561
PicAxe_OCR (Paddle)	1567	152	26	28	631
PicAxe_YOLO	2136	11	75	142	97
	ANTHROPOCENE
Manual	842	–	–	–	–
PicAxe_OCR (pytesseract)	697	8	3	254	255
PicAxe OCR, (Paddle)	794	142	1	427	41
PicAxe_YOLO	1014	29	200	30	62

Table 3

Precision, recall, F1: Corpus-level extraction performance evaluated by the total number of true positives, true positives (incomplete), false positive (acceptable), false positive (unacceptable), and false negatives extracted.

	BIOFILM			ANTHROPOCENE
	PRECISION	RECALL	F1	PRECISION	RECALL	F1
PicAxe OCR (pytesseract)	0.5262	0.4462	0.4829	0.8044	0.6357	0.7102
PicAxe_OCR (Paddle)	0.9695	0.7151	0.8231	0.6862	0.6662	0.6760
PicAxe_YOLO	0.9082	0.8724	0.8899	0.8193	0.7813	0.7998

Results

True positive (complete): These outputs represent instances in which PicAxe extracted entire figures, multiple figures on a single page (if two photographs on a single PDF page were considered separate figures, but PicAxe extracted them together), and whole sections of multi-part figures (for example, if there were multiple photographs in a single figure and PicAxe extracted single photographs from that figure as separate)(Figure 3, Result Type 1). If PicAxe extracted multiple sections of multi-part figures but missed a section, that missed section was marked false negative. Even though both PicAxe pipelines extracted fewer total images than a manual extractor for the Biofilm corpus (Table 1), they extracted a majority of ground-truth figures. For the Anthropocene corpus, PicAxe-YOLO extracted more true positives than a manual extractor because there were many instances in which it extracted multiple separate images that belonged to a single figure, whereas PicAxe-OCR extracted more false positives (Table 2). Spacing of figure components and figure parts affect the performance of both pipelines, with greater space between components increasing the likelihood that components are extracted separately.

For both corpora, PicAxe-YOLO performed better overall than PicAxe-OCR, (tesseract) and (Paddle) with a higher F1.

True positive (incomplete): These outputs represent instances in which PicAxe extracted incomplete portions of figures (Figure 3, Result Type 2). Of the three PicAxe versions tested, PicAxe-OCR (Paddle) was most prone to extracting incomplete portions of images, particularly from figures like graphs or diagrams that included linear components which were spaced more widely, such that the PicAxe-OCR (Paddle) bounding function did not extract these components together (Table 2) and eliminated components with small aspect ratios. PicAxe-OCR (Paddle)’s rules for aspect-ratio filtering and bounding can be customized based on user needs.

False positive (acceptable): Not all false positive outputs are unacceptable. Non-textual content like library or journal logos, barcodes, graphic shapes and lines meant to separate page content, and large scan aberrations are not eliminated by OCR, but they do not represent ground-truth figures (Figure 3, Result Type 3). Such false positive results must be cleaned manually before further image analysis. Acceptable false positive results (specifically the library and journal logo specific to Anthropocene) were present in every PDF in the Anthropocene corpus, and PicAxe-YOLO extracted them in every case. It is presently unclear why PicAxe-OCR (Paddle) did not extract these same acceptable false positives.

Another common acceptable false positive result is columnar texts that do not represent a desired figure (i.e., reference lists and indices). In terms of table extraction, the table-sensitive PicAxe-YOLO model demonstrated a reduced tendency to mistakenly identify columnar texts and equations as tables. However, some table-like items were still detected, suggesting that the YOLOv8 model could benefit from additional class definitions beyond single-class training, such as “column list” or “region list,” to enhance instance learning. Post-processing techniques, particularly those utilizing OCR, could be employed to remove cropped regions containing undesired textual elements like “contents” or “nomenclature” from the detected tables.

False positive (unacceptable): Some false positive outputs are unacceptable. PicAxe-OCR is more prone to this error because sometimes it collects text-like data (particularly mathematical symbols) alone or along with desired image data (Figure 3, Result Type 4). Compared to PicAxe-OCR (pytesseract), PicAxe-OCR (Paddle) achieved significant gains in performance for the Biofilm corpus as it greatly reduced the number of unacceptable false positives (Table 3) likely because of added preprocessing steps and bounding rules. For the Anthropocene corpus, on the other hand, PicAxe-OCR (Paddle) extracted far more unacceptable false positive results than the other two pipelines because PDFs in this corpus contained many symbolic text elements that were not eliminated via OCR, and thus extracted erroneously. We are working to remedy this particular type of false positive result, because without it, PicAxe-OCR (Paddle) would have outperformed PicAxe-YOLO for the Anthropocene corpus with an F1 of 0.9775. For the Anthropocene corpus, PicAxe-YOLO generated unacceptable false positives when it extracted a full figure, but then extracted a repeat fragment from that same figure, producing unwanted overlapping extraction results.

False negative: PicAxe occasionally missed images entirely (Figure 3, Result Type 5). Though PicAxe-OCR (Paddle) demonstrated a significant decrease in the number of false positives for the Biofilm corpus compared to PicAxe-OCR (pytesseract), PicAxe-OCR (Paddle) missed more images entirely (Table 2), suggesting that content filtering and cleaning, or strict bounding rules, resulted in figures being incorrectly eliminated. This result points to a trade-off between reducing unacceptable false positives and increasing true positives for the Biofilm corpus. This trade-off did not appear to apply for PicAxe-OCR (Paddle) for the Anthropocene corpus where components of single vector-graphic images are generally spaced more closely. For both corpora, PicAxe-YOLO was prone to missing one of two adjacent tables on a single PDF page, but it is possible that using table-sensitive mode would improve its performance.

Performance Data

Discussion

PicAxe overcomes some of the issues of existing open-source image extraction software so researchers can use it to extract image data from PDF corpora quickly. Existing free software tools like pdfplumber and PyMuPDF do not represent text and figure data in the same format, such that these tools cannot reliably extract image data from corpora that include scanned PDFs [9]. Additionally, existing open-source image extraction software will not produce results useful for further image analysis because many figures that appear to be embedded as a single figure are embedded in parts, and those parts will be extracted instead of whole figures. By converting text and figure data to a single image format, then applying different figure detection techniques, PicAxe-OCR (Paddle) and PicAxe-YOLO can extract figures from corpora with heterogeneous PDF structure and figure syntax more reliably. Additionally, since PicAxe was developed as research software, researchers can apply PicAxe to their unique corpora without writing new code to manage files. We expect researchers looking to extract figures from corpora that contain both scanned and born-digital documents will find PicAxe useful.

We are working to improve PicAxe performance in several targeted areas:

Figure Spacing and Grouping: Spacing of figures and figure components greatly affects extraction results for all PicAxe pipelines. Even human annotators must decide how to group multiple images from a single document, especially when a single figure is comprised of multiple, separate images. Multiple extraction schemes may be valid depending on specific project requirements and user judgment. Even though PicAxe extracted the majority of images for both corpora, the extraction groupings might not be useful for further image analysis depending on a researcher’s questions and needs. We are testing different bounding rules for PicAxe-OCR (Paddle) to improve image grouping, however, it is clear that inventing general grouping rules for a syntactically heterogeneous corpus will be more difficult as the spacing of components in single and multiple figures will vary more. PicAxe-YOLO could be trained to make grouping decisions closer to a manual extractor, depending on user needs. Recent advancements in YOLOv8-based distance estimation demonstrate how spatial relationships between detected objects can be better understood using Coordinate Attention and Wise-IoU [26]. Applying similar distance-aware techniques to PicAxe-YOLO could enhance figure grouping accuracy, ensuring that figures with meaningful proximity are grouped correctly.

Unwanted Non-textual Content: Common non-textual content that causes PicAxe extraction performance to decline includes warped or tilted text, dark borders introduced when scanning hard-copies, vertical and horizontal page break lines, and mathematical symbols. We are currently testing different preprocessing steps and improving PicAxe’s functions for removing scan borders and document layout lines prior to extraction, as well as recognizing and eliminating mathematical symbols. Prior research has proposed a shabby page pipeline designed to enhance robustness against scanned document degradation, border artifacts, and layout distortions [27]. Incorporating this pipeline into PicAxe’s preprocessing module could further improve document layout normalization, noise reduction, and figure extraction accuracy.

Dealing with a Syntactically Heterogeneous Corpus: Although PicAxe-YOLO performed better than PicAxe-OCR (Paddle) for both corpora, it is unclear that it would maintain an advantage as more heterogeneous corpora are tested. Based on our results, we are developing a PicAxe pipeline that combines aspects of PicAxe-OCR and PicAxe-YOLO to improve figure extraction performance on syntactically heterogeneous corpora

Conclusions

Researchers who seek to extract many figures from a structurally and syntactically heterogeneous corpus of PDFs should find PicAxe useful. PicAxe will not perform perfect extraction and users should always check the extraction results before running further analysis. Despite its limitations, even partial automation of figure extraction for large corpora will save researchers time and help them learn about the content of their corpora.

(2) Availability

Operating system

Windows, MacOS, Linux

PicAxe-YOLO

Programming language

Python3 3.10

Dependencies

PicAxe-YOLO depends on NumPy, OpenCV-python, pdf2image, matplotlib, ultralytics, tensorflow, keras, pyzbar, and huggingface-hub. The specific versions of the dependencies can be found in the requirements file in the PicAxe repository.

Users should be aware that using different versions of these applications might result in slightly different image extraction results.

PicAxe-OCR

Programming language

Python3 3.10.12

Additional system requirements

PicAxe-OCR requires a system with at least 8GB of RAM for processing large PDF files efficiently.

Dependencies

The core libraries of PicAxe-OCR include NumPy, OpenCV-python, Opencv-pythonheadless, pandas, scikit-image, scipy, and Pillow. The OCR and image extraction libraries include PaddleOCR, PaddlePaddle, pytesseract, PyMuPDF, pdf2image, and pdfplumber. Additionally utilities include jsonschema, matplotlib, requests, and beautifulsoup4. Logging and utilities use coloredlogs and tqdm.

The specific versions of the dependencies can be found in the requirements file in the PicAxe repository. Users should be aware that using different versions of these applications might result in slightly different image extraction results.

List of contributors

ACG and ARD conceptualized the project, supervised the research, and wrote the manuscript. KK, QZ, and BF developed the software and worked with ACG to test it. JD provided technical guidance and developed the interface to the Giles Ecosystem.

Software location

Archive

Name: Zenodo
Persistent identifier: https://doi.org/10.5281/zenodo.14873181
License: MIT
Publisher: Anna C. Guerrero, Krishna Kamath, Qilin Zhou, Bruno Felalaga,
Julia Damerow, Aaron R. Dinner
Version published: v1.0.1
Date published: 06/10/25

Code repository

Name: Github
Identifier: https://github.com/acguerr1/PicAxe
License: MIT
Date published: 06/10/25

Language

Python 3.10

(3) Reuse potential

We designed PicAxe with the broadest possible audience in mind. We specifically consulted with and considered the needs of researchers from disciplines in the humanities who do not usually learn programming skills. PicAxe’s open-source code is structured and commented with a novice programmer in mind. For ease of installing and running the code, we also included a Docker image for each pipeline.

With our audience in mind, we designed PicAxe to handle large, structurally and syntactically heterogenous corpora of text-image documents. Researchers can use PicAxe to automatically extract figures from corpora that include multiple types of documents (books, journal articles, letters, etc.), digitally preserved with multiple methods (scanned and born-digital), and span multiple cultures or longer time periods. For example, researchers are currently using PicAxe to extract figures from articles published by a single scientific journal from the years 1955 to 2025, a time over which publication format and figure-production technology changed dramatically. Researchers and programmers with more advanced coding skills may leverage both pipelines to fine-tune image extraction algorithms to specific corpora.

PicAxe continues to evolve. We are working to improve PicAxe’s overall performance by refining (1) the removal of unwanted non-textual content before figure extraction, and (2) spacing of remaining content after text removal. We are also testing combinations of different parts of PicAxe-YOLO with PicAxe-OCR to improve performance on maximally heterogenous corpora. To contribute to the PicAxe source code, submit questions, or report problems, users can contact the author Anna Clemencia Guerrero directly, or create issues in GitHub (https://github.com/acguerr1/PicAxe/issues).

PicAxe will be more widely accessible and institutionally maintained as both pipelines become integrated into The Giles Ecosystem, a distributed system based on Apache Kafka. The system components are implemented using Java and the Spring Framework, and are available under an open-source license on GitHub (http://github.com/diging/). The Giles Ecosystem is developed and maintained by the Digital Innovation Group at Arizona State University as part of the School of Complex Adaptive Systems. To report problems or submit questions, users can create issues in GitHub (https://github.com/diging/gileco-giles-web/issues) or contact the author Julia Damerow directly.

Acknowledgements

We thank Maria Guerrero for her help collecting data and metadata, and the Cyborg Cells collaboration for nucleating initial discussions. We are grateful to the developers of the many Python packages upon which PicAxe depends. We are also grateful to data annotators whose work makes models work. Finally, we thank the many researchers who created the documents used to train our models.

Competing interests

The authors have no competing interests to declare.

Author Contributions

Krishna Kamath, Qilin Zhou, and Bruno Felalaga made equal contributions.

PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Full Article

(1) Overview

Introduction

Figure 1

Functionality

Figure 2

Implementation and architecture

PicAxe-YOLO

PicAxe-OCR

Quality control

Performance

Table 1

Table 2

Table 3

Figure 3

Results

Performance Data

Discussion

Conclusions

(2) Availability

Operating system

PicAxe-YOLO

Programming language

Dependencies

PicAxe-OCR

Programming language

Additional system requirements

Dependencies

List of contributors

Software location

Language

(3) Reuse potential

Acknowledgements

Competing interests

Author Contributions

Paradigm

My account