Have a personal or library account? Click to login
PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files Cover

PicAxe: Extracting Figures from Structurally and Syntactically Heterogeneous Corpora of PDF Files

Open Access
|Dec 2025

References

  1. 1Lee B, Seo MK, Kim D, Shin I, Schich M, Jeong H, Han SK. Dissecting Landscape Art History with Information Theory. Proceedings of the National Academy of Sciences. 2020;117(43):2658026590. DOI: 10.1073/pnas.2011927117
  2. 2Miton H, Sperber D, Hernik M. A Forward Bias in Human Profile-Oriented Portraits. Cognitive Science. 2020;44(6):e12866. DOI: 10.1111/cogs.12866
  3. 3Moreira D, Cardenuto JP, Shao R, Baireddy S, Cozzolino D, Gragnaniello D, Abd-Almageed W, Bestagini P, Tubaro S, Rocha A, Scheirer W, Verdoliva L, Delp E. SILA: A System for Scientific Image Analysis. Scientific Reports. 2022;12(1):18306. DOI: 10.1038/s41598-022-21535-3
  4. 4Chen Y, Sherren K, Smit M, Lee KY. Using Social Media Images as Data in Social Science Research. New Media & Society. 2023;25(4):849871. DOI: 10.1177/14614448211038761
  5. 5Soh LK, Lorang L, Pack C, Liu Y. Applying Image Analysis and Machine Learning to Historical Newspaper Collections. The American Historical Review. 2023;128(3):13821389. DOI: 10.1093/ahr/rhad369
  6. 6Valente J, Antonio J, Mora C, Jardim S. Developments in Image Processing Using Deep Learning and Reinforcement Learning. Journal of Imaging. 2023;9(10):207. DOI: 10.3390/jimaging9100207
  7. 7Artifex. PyMuPDF; 2024. URL: https://github.com/pymupdf/PyMuPDF.
  8. 8Binmakhashen GM, Mahmoud SA. Document Layout Analysis: A Comprehensive Survey. ACM Computing Surveys. 2019;52(6):109. DOI: 10.1145/3355610
  9. 9Yu F, Huang J, Luo Z, Zhang L, Lu W. An Effective Method for Figures and Tables Detection in Academic Literature. Information Processing & Management. 2023;60(3). DOI: 10.1016/j.ipm.2023.103286
  10. 10Subramani N, Matton A, Greaves M, Lam A. A Survey of Deep Learning Approaches for OCR and Document Understanding. ArXiv; 2020. DOI: 10.48550/arXiv.2011.13534
  11. 11Ultralytics. YOLOv8; 2024. URL: https://github.com/ultralytics/ultralytics/blob/main/docs/en/models/yolov8.md.
  12. 12Community P. PaddleOCR; 2024. URL: https://github.com/PaddlePaddle/PaddleOCR.
  13. 13Guerrero AC, Kamath K, Zhou Q, Felalaga B, Dinner AR. PicAxe; 2025. DOI: 10.5281/zenodo.14873182
  14. 14Damerow J, Peirson BRE, Laubichler MD. The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents. Journal of Open Research Software. 2017;5(1):26. DOI: 10.5334/jors.164
  15. 15Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics. 1979;9(1):6266. DOI: 10.1109/TSMC.1979.4310076
  16. 16Vincent O, Folorunso O. A Descriptive Algorithm for Sobel Image Edge Detection. Informing Science + IT Education Conference; 2009. DOI: 10.28945/3351
  17. 17Kahu SY, Ingram WA, Fox EA, Wu J. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. ACM/IEEE Joint Conference on Digital Libraries; 2021. pp. 180191. DOI: 10.1109/JCDL52503.2021.00030
  18. 18Peter. Find the Images Dataset; 2022. URL: https://universe.roboflow.com/peter-j1jzx/findthe-images.
  19. 19Da C, Luo C, Zheng Q, Yao C. Vision Grid Transformer for Document Layout Analysis. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01783
  20. 20Rezanezhad V, Baierer K, Gerber M, Labusch K, Neudecker C. Document Layout Analysis with Deep Learning and Heuristics. Proceedings of the 7th International Workshop on Historical Document Imaging and Processing. 2023;7378. DOI: 10.1145/3604951.3605513
  21. 21Taraday V, Baskin C. Enhanced Meta Label Correction for Coping with Label Corruption. ArXiv; 2023. DOI: 10.1109/ICCV51070.2023.01493
  22. 22Hudson L. pyzbar; 2022. URL: https://github.com/NaturalHistoryMuseum/pyzbar.
  23. 23Belval E. pdf2image; 2024. URL: https://github.com/Belval/pdf2image.
  24. 24Shen Z, Zhang R, Dell M, Lee BCG, Carlson J, Li W. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. Document Analysis and Recognition – ICDAR. 2021;12821:131146. DOI: 10.1007/978-3-030-86549-8_9
  25. 25Hoffstaetter S, Lee M. pytesseract; 2024. URL https://github.com/h/pytesseract.
  26. 26Khow ZJ, Tan YF, Karim HA, Rashid HAA. Improved YOLOv8 Model for a Comprehensive Approach to Object Detection and Distance Estimation. IEEE Access. 2024;12:6375463767. DOI: 10.1109/ACCESS.2024.3396224
  27. 27Groleau A, Chee KW, Larson S, Maini S, Boarman J. ShabbyPages: A Reproducible Document Denoising and Binarization Dataset. ArXiv; 2023. DOI: 10.48550/arXiv.2303.09339
DOI: https://doi.org/10.5334/jors.574 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 28, 2025
Accepted on: Dec 1, 2025
Published on: Dec 16, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Anna C. Guerrero, Krishna Kamath, Qilin Zhou, Bruno Felalaga, Julia Damerow, Aaron R. Dinner, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.