Have a personal or library account? Click to login
PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files Cover

PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Open Access
|Dec 2025

Figures & Tables

jors-13-574-g1.png
Figure 1

Current free automated figure extractors will perform poorly on structurally heterogeneous corpora: Figure extraction results from PyMuPDF [7]. Case 1: Individual figures are not embedded within the PDF structure; Case 2: Whole figures are embedded within PDF structure; Case 3: Incomplete parts of single figures are embedded within PDF structure.

jors-13-574-g2.png
Figure 2

PicAxe visual overview.

Table 1

Total images extracted: Corpus-level extraction performance evaluated by total number of images extracted (# images) compared to total number of images extracted during manual extraction (% change).

BIOFILMANTHROPOCENE
# IMAGES% CHANGE# IMAGES% CHANGE
Manual2748842
PicAxe OCR (pytesseract)3126+13.76%962+14.25%
PicAxe OCR (Paddle)1773–35.48%1273+51.19%
PicAxe YOLO2364–13.97%1364+61.64%
Table 2

Positives and negatives: Image-level extraction performance evaluated by number of images extracted compared to manual extraction. Complete true positives (TPc), incomplete true positives (TPi), acceptable false positives (FPa), unacceptable false positives (FPu), and false negatives (FN).

BIOFILM
TPCTPIFPAFPUFN
Manual2748
PicAxe OCT (pytesseract)1626191111370561
PicAxe_OCR (Paddle)15671522628631
PicAxe_YOLO2136117514297
ANTHROPOCENE
Manual842
PicAxe_OCR (pytesseract)69783254255
PicAxe OCR, (Paddle)794142142741
PicAxe_YOLO1014292003062
Table 3

Precision, recall, F1: Corpus-level extraction performance evaluated by the total number of true positives, true positives (incomplete), false positive (acceptable), false positive (unacceptable), and false negatives extracted.

BIOFILMANTHROPOCENE
PRECISIONRECALLF1PRECISIONRECALLF1
PicAxe OCR (pytesseract)0.52620.44620.48290.80440.63570.7102
PicAxe_OCR (Paddle)0.96950.71510.82310.68620.66620.6760
PicAxe_YOLO0.90820.87240.88990.81930.78130.7998
jors-13-574-g3.png
Figure 3

PicAxe extraction result types: PicAxe can produce true positive results (complete and incomplete), false positive results (acceptable and unacceptable), and false negative results.

DOI: https://doi.org/10.5334/jors.574 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 28, 2025
Accepted on: Dec 1, 2025
Published on: Dec 16, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Anna C. Guerrero, Krishna Kamath, Qilin Zhou, Bruno Felalaga, Julia Damerow, Aaron R. Dinner, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.