
Figure 1
Current free automated figure extractors will perform poorly on structurally heterogeneous corpora: Figure extraction results from PyMuPDF [7]. Case 1: Individual figures are not embedded within the PDF structure; Case 2: Whole figures are embedded within PDF structure; Case 3: Incomplete parts of single figures are embedded within PDF structure.

Figure 2
PicAxe visual overview.
Table 1
Total images extracted: Corpus-level extraction performance evaluated by total number of images extracted (# images) compared to total number of images extracted during manual extraction (% change).
| BIOFILM | ANTHROPOCENE | |||
|---|---|---|---|---|
| # IMAGES | % CHANGE | # IMAGES | % CHANGE | |
| Manual | 2748 | – | 842 | – |
| PicAxe OCR (pytesseract) | 3126 | +13.76% | 962 | +14.25% |
| PicAxe OCR (Paddle) | 1773 | –35.48% | 1273 | +51.19% |
| PicAxe YOLO | 2364 | –13.97% | 1364 | +61.64% |
Table 2
Positives and negatives: Image-level extraction performance evaluated by number of images extracted compared to manual extraction. Complete true positives (TPc), incomplete true positives (TPi), acceptable false positives (FPa), unacceptable false positives (FPu), and false negatives (FN).
| BIOFILM | |||||
|---|---|---|---|---|---|
| TPC | TPI | FPA | FPU | FN | |
| Manual | 2748 | – | – | – | – |
| PicAxe OCT (pytesseract) | 1626 | 19 | 111 | 1370 | 561 |
| PicAxe_OCR (Paddle) | 1567 | 152 | 26 | 28 | 631 |
| PicAxe_YOLO | 2136 | 11 | 75 | 142 | 97 |
| ANTHROPOCENE | |||||
| Manual | 842 | – | – | – | – |
| PicAxe_OCR (pytesseract) | 697 | 8 | 3 | 254 | 255 |
| PicAxe OCR, (Paddle) | 794 | 142 | 1 | 427 | 41 |
| PicAxe_YOLO | 1014 | 29 | 200 | 30 | 62 |
Table 3
Precision, recall, F1: Corpus-level extraction performance evaluated by the total number of true positives, true positives (incomplete), false positive (acceptable), false positive (unacceptable), and false negatives extracted.
| BIOFILM | ANTHROPOCENE | |||||
|---|---|---|---|---|---|---|
| PRECISION | RECALL | F1 | PRECISION | RECALL | F1 | |
| PicAxe OCR (pytesseract) | 0.5262 | 0.4462 | 0.4829 | 0.8044 | 0.6357 | 0.7102 |
| PicAxe_OCR (Paddle) | 0.9695 | 0.7151 | 0.8231 | 0.6862 | 0.6662 | 0.6760 |
| PicAxe_YOLO | 0.9082 | 0.8724 | 0.8899 | 0.8193 | 0.7813 | 0.7998 |

Figure 3
PicAxe extraction result types: PicAxe can produce true positive results (complete and incomplete), false positive results (acceptable and unacceptable), and false negative results.
