PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Anna C. Guerrero; Krishna Kamath; Qilin Zhou; Bruno Felalaga; Julia Damerow; Aaron R. Dinner

doi:10.5334/jors.574

Abstract

PicAxe is open-source Python software that researchers can use to extract figures from corpora of PDF files that contain text and images. It is designed to extract figures from corpora that include both scanned and “born-digital” PDF files (structurally heterogeneous) of documents from different cultures and time periods (syntactically heterogeneous). In this paper we describe the functionality of PicAxe and demonstrate its functionality on two corpora. One corpus contains scanned documents that represent the development of the “microbial biofilm” concept from 1929 to 1974. The second corpus contains “born-digital” documents from the academic journal Anthropocene from 2014 to 2023.

References

1Lee B, Seo MK, Kim D, Shin I, Schich M, Jeong H, Han SK. Dissecting Landscape Art History with Information Theory. Proceedings of the National Academy of Sciences. 2020;117(43):26580–26590. DOI: 10.1073/pnas.2011927117
Back to article
2Miton H, Sperber D, Hernik M. A Forward Bias in Human Profile-Oriented Portraits. Cognitive Science. 2020;44(6):e12866. DOI: 10.1111/cogs.12866
Back to article
3Moreira D, Cardenuto JP, Shao R, Baireddy S, Cozzolino D, Gragnaniello D, Abd-Almageed W, Bestagini P, Tubaro S, Rocha A, Scheirer W, Verdoliva L, Delp E. SILA: A System for Scientific Image Analysis. Scientific Reports. 2022;12(1):18306. DOI: 10.1038/s41598-022-21535-3
Back to article
4Chen Y, Sherren K, Smit M, Lee KY. Using Social Media Images as Data in Social Science Research. New Media & Society. 2023;25(4):849–871. DOI: 10.1177/14614448211038761
Back to article
5Soh LK, Lorang L, Pack C, Liu Y. Applying Image Analysis and Machine Learning to Historical Newspaper Collections. The American Historical Review. 2023;128(3):1382–1389. DOI: 10.1093/ahr/rhad369
Back to article
6Valente J, Antonio J, Mora C, Jardim S. Developments in Image Processing Using Deep Learning and Reinforcement Learning. Journal of Imaging. 2023;9(10):207. DOI: 10.3390/jimaging9100207
Back to article
7Artifex. PyMuPDF; 2024. URL: https://github.com/pymupdf/PyMuPDF.
Back to article
8Binmakhashen GM, Mahmoud SA. Document Layout Analysis: A Comprehensive Survey. ACM Computing Surveys. 2019;52(6):109. DOI: 10.1145/3355610
Back to article
9Yu F, Huang J, Luo Z, Zhang L, Lu W. An Effective Method for Figures and Tables Detection in Academic Literature. Information Processing & Management. 2023;60(3). DOI: 10.1016/j.ipm.2023.103286
Back to article
10Subramani N, Matton A, Greaves M, Lam A. A Survey of Deep Learning Approaches for OCR and Document Understanding. ArXiv; 2020. DOI: 10.48550/arXiv.2011.13534
Back to article
11Ultralytics. YOLOv8; 2024. URL: https://github.com/ultralytics/ultralytics/blob/main/docs/en/models/yolov8.md.
Back to article
12Community P. PaddleOCR; 2024. URL: https://github.com/PaddlePaddle/PaddleOCR.
Back to article
13Guerrero AC, Kamath K, Zhou Q, Felalaga B, Dinner AR. PicAxe; 2025. DOI: 10.5281/zenodo.14873182
Back to article
14Damerow J, Peirson BRE, Laubichler MD. The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software. 2017;5(1):26. DOI: 10.5334/jors.164
Back to article
15Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics. 1979;9(1):62–66. DOI: 10.1109/TSMC.1979.4310076
Back to article
16Vincent O, Folorunso O. A Descriptive Algorithm for Sobel Image Edge Detection. Informing Science + IT Education Conference; 2009. DOI: 10.28945/3351
Back to article
17Kahu SY, Ingram WA, Fox EA, Wu J. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. ACM/IEEE Joint Conference on Digital Libraries; 2021. pp. 180–191. DOI: 10.1109/JCDL52503.2021.00030
Back to article
18Peter. Find the Images Dataset; 2022. URL: https://universe.roboflow.com/peter-j1jzx/findthe-images.
Back to article
19Da C, Luo C, Zheng Q, Yao C. Vision Grid Transformer for Document Layout Analysis. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01783
Back to article
20Rezanezhad V, Baierer K, Gerber M, Labusch K, Neudecker C. Document Layout Analysis with Deep Learning and Heuristics. Proceedings of the 7th International Workshop on Historical Document Imaging and Processing. 2023;73–78. DOI: 10.1145/3604951.3605513
Back to article
21Taraday V, Baskin C. Enhanced Meta Label Correction for Coping with Label Corruption. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01493
Back to article
22Hudson L. pyzbar; 2022. URL: https://github.com/NaturalHistoryMuseum/pyzbar.
Back to article
23Belval E. pdf2image; 2024. URL: https://github.com/Belval/pdf2image.
Back to article
24Shen Z, Zhang R, Dell M, Lee BCG, Carlson J, Li W. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. Document Analysis and Recognition – ICDAR. 2021;12821:131–146. DOI: 10.1007/978-3-030-86549-8_9
Back to article
25Hoffstaetter S, Lee M. pytesseract; 2024. URL https://github.com/h/pytesseract.
Back to article
26Khow ZJ, Tan YF, Karim HA, Rashid HAA. Improved YOLOv8 Model for a Comprehensive Approach to Object Detection and Distance Estimation. IEEE Access. 2024;12:63754–63767. DOI: 10.1109/ACCESS.2024.3396224
Back to article
27Groleau A, Chee KW, Larson S, Maini S, Boarman J. ShabbyPages: A Reproducible Document Denoising and Binarization Dataset. ArXiv; 2023. DOI: 10.48550/arXiv.2303.09339
Back to article

PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Abstract

Paradigm

My account