
Detailed Implementation of a Reproducible Machine Learning-Enabled Workflow
References
- Ahmed, H and Lofstead, J 2022
Managing Randomness to Enable Reproducible Machine Learning . In: Proceedings of the 5th International Workshop on Practical Reproducible Evaluation of Computer Systems. New York, NY, USA: Association for Computing Machinery. pp. 15–20. DOI: 10.1145/3526062.3536353 - Allen, L, O’Connell, A and Kiermer, V 2019 How can we ensure visibility and diversity in research contributions? How the Contributor Role Taxonomy (CRediT) is helping the shift from authorship to contributorship. Learned Publishing, 32(1): 71–74. DOI: 10.1002/leap.1210
- Bader, D 2016 pytest-mypy: Mypy static type checker plugin for Pyest. Available at
https://github.com/realpython/pytest-mypy [Last accessed 22 November 2022]. - Baggerly, K A and Coombes, K R 2009 Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. The Annals of Applied Statistics, 3(4): 1309–1334. DOI: 10.1214/09-AOAS291
- Baker, M 2016 1,500 scientists lift the lid on reproducibility. Nature, 533(7604): 452–454. DOI: 10.1038/533452a
- Berman, H M, Westbrook, J, Feng, Z, Gilliland, G, Bhat, T N, Weissig, H, Shindyalov, I N and Bourne, P E 2000 The Protein Data Bank. Nucleic Acids Research, 28(1): 235–242. DOI: 10.1093/nar/28.1.235
- Bush, K A, Calvert, M L and Kilts, C D 2022 Lessons learned: A neuroimaging research center’s transition to open and reproducible science. Frontiers in Big Data, 5. DOI: 10.3389/fdata.2022.988084
- Coburn, E and Johnston, L 2020 Testing our assumptions: Preliminary results from the Data Curation Network. Journal of eScience Librarianship, 9(1). DOI: 10.7191/jeslib.2020.1186
- Conda 2017 Available at
https://www.anaconda.com [Last accessed 22 November 2022]. - Country Codes – ISO 3166 n.d. Available at
https://www.iso.org/iso-3166-country-codes.html . - de Saint-Exupéry, A 1943 Le petit prince [The little prince]. Verenigde State van Amerika: Reynal & Hitchkock (US), Gallimard (FR).
- Fanelli, D 2018 Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences, 115(11): 2628–2631. DOI: 10.1073/pnas.1708272114
- Ferguson, C, Araújo, D, Faulk, L, Gou, Y, Hamelers, A, Huang, Z, Ide-Smith, M, Levchenko, M, Marinos, N, Nambiar, R, Nassar, M, Parkin, M, Pi, X, Rahman, F, Rogers, F, Roochun, Y, Saha, S, Selim, M, Shafique, Z, Sharma, S, Stephenson, D, Talo’, F, Thouvenin, A, Tirunagari, S, Vartak, V, Venkatesan, A, Yang, X and McEntyre, J 2021 Europe PMC in 2020. Nucleic Acids Research, 49(D1): D1507–D1514. DOI: 10.1093/nar/gkaa994
- Fielding, R, Nottingham, M and Reschke, J 2022
RFC 9910 HTTP Semantics . Internet Engineering Task Force. Available athttps://www.doi.org/10.17487/RFC9110 . - Figueiredo, L, Scherer, C and Sarmento Cabral, J 2022 A simple kit to use computational notebooks for more openness, reproducibility, and productivity in research. PLOS Computational Biology, 18(9):
e1010356 . DOI: 10.1371/journal.pcbi.1010356 - Gee, C 2015 pytest-pylint: pytest plugin for running pylint against your codebase. Available at
https://github.com/carsongee/pytest-pylint [Last accessed 22 November 2022]. - GNU Make 1988. Available at
https://www.gnu.org/software/make/ [Last accessed 22 November 2022]. - Google Inc. yapf: A formatter for Python files, 2004. Available at
https://github.com/google/yapf [Last accessed 22 November 2022]. - Grüning, B, Chilton, J, Köster, J, Dale, R, Soranzo, N, van den Beek, M, Goecks, J, Backofen, R, Nekrutenko, A and Taylor, J 2018 Practical computational reproducibility in the life sciences. Cell Systems, 6(6): 631–35. DOI: 10.1016/j.cels.2018.03.014
- Hall, M and Letcher, B 2020 Snakefmt: The uncompromising Snakemake code formatter. Available at
https://github.com/snakemake/snakefmt [Last accessed 22 November 2022]. - Haring, R and Bell, R J 2018 Lack of research reproducibility, the rise of open science and the need for continuing education in research methods. Climacteric, 21(5): 413–414. DOI: 10.1080/13697137.2018.1476968
- Heil, B J, Crawford J and Greene, C S 2023 The effect of non-linear signal in classification problems using gene expression. PLoS Computational Biology, 19(3):
e1010984 . DOI: 10.1371/journal.pcbi.1010984 - Heil, B J, Hoffman, M M, Markowetz, F, Lee, S-I, Greene, C S and Hicks, S C 2021 Reproducibility standards for machine learning in the life sciences. Nature Methods, 18(10): 1132–1135. DOI: 10.1038/s41592-021-01256-7
- Hook, D W and Porter, S J 2021 Scaling scientometrics: Dimensions on Google BigQuery as an infrastructure for large-scale analysis. Frontiers in Research Metrics and Analytics, 6. Available at
https://www.frontiersin.org/articles/10.3389/frma.2021.656233 [Last accessed 3 February 2023]. - Imker, H J and Schackart, K E 2022 Open Science implementation plan for the biodata resource inventory. Zenodo. DOI: 10.5281/zenodo.7392518
- Imker, H J and Schackart, K E 2023 Manual review process for the biodata resource inventory. Zenodo. DOI: 10.5281/zenodo.7768363
- Imker, H J, Schackart III, K E, Istrate, A-M and Cook, C E 2023 A machine learning-enabled open biodata resource inventory from the scientific literature. PLOS ONE, 18(11): 1–28. DOI: 10.1371/journal.pone.0294812
- Kaczmarzyk, J R, Gupta, R, Kurc, T M, Abousamra, S, Saltz, J H and Koo, P K 2023 ChampKit: A framework for rapid evaluation of deep neural networks for patch-based histopathology classification. Computer Methods and Programs in Biomedicine, 239. DOI: 10.1016/j.cmpb.2023.107631
- Kim, Y-M, Poline, J-B and Dumas, G 2018 Experimenting with reproducibility: A case study of robustness in bioinformatics. GigaScience, 7(7). DOI: 10.1093/gigascience/giy077
- Köster, J and Rahmann, S 2012 Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19): 2520–2522. DOI: 10.1093/bioinformatics/bts480
- Krekel, H 2004 pytest: The pytest framework makes it easy to write small tests, yet scales to support complex functional testing. Available at
https://github.com/pytest-dev/pytest [Last accessed 22 November 2022]. - Leek, J T and Jager, L R 2017 Is Most Published Research Really False? Annual Review of Statistics and Its Application, 4(1): 109–122. DOI: 10.1146/annurev-statistics-060116-054104
- Lehtosalo, J 2012 mypy: Optional static typing for Python. Available at
https://github.com/python/mypy [Last accessed 22 November 2022]. - Lockhert, T 2015 pytest-flake8: Pytest plugin to run flake8. Available at
https://github.com/tholo/pytest-flake8 [Last accessed 22 November 2022]. - Merkel, D 2014 Docker: Lightweight linux containers for consistent development and deployment. Linux j, 239(2): 2.
- Müller, K, Walthert, L and Patil, I 2021 styler: Non-invasive pretty printing of R code. Available at
https://github.com/r-lib/styler [Last accessed 22 November 2022]. - Peng, R D 2011 Reproducible Research in Computational Science. Science, 334(6060): 1226–1227. DOI: 10.1126/science.1213847
- Peng, R D and Hicks, S C 2021 Reproducible Research: A Retrospective. Annual Review of Public Health, 42(1): 79–93. DOI: 10.1146/annurev-publhealth-012420-105110
- Perkel, J M 2020 Challenge to scientists: does your ten-year-old code still run? Nature, 584(7822): 656–658. DOI: 10.1038/d41586-020-02462-7
- pypi n.d. Python Package Index – PyPI. Available at
https://pypi.org/ [Last accessed 22 November 2022]. - Sveidqvist, K 2014 Mermaid: Generation of diagrams like flowcharts or sequence diagrams from text in a similar manner as markdown. Available at
https://github.com/mermaid-js/mermaid/ [Last accessed 22 November 2022]. - The Europe PMC Consortium 2015 Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research, 43(D1): D1042–D1048. DOI: 10.1093/nar/gku1061
- Thénault, S 2001 Pylint: It’s not just a linter that annoys you! Available at
https://github.com/PyCQA/pylint [Last accessed 22 November 2022]. - Ushey, K 2022 renv: Project Environments. Available at
https://rstudio.github.io/renv/ [Last accessed 6 January 2023]. - Walters, W P 2020 Code sharing in the Open Science era. Journal of Chemical Information and Modeling, 60(10): 4417–4420. DOI: 10.1021/acs.jcim.0c01000
- Wanner, J, Cuellar, L K, Rausch, L, Berendzen, K W, Wanke, F, Gabernet, G, Harter, K and Nahnsen, S 2023 nf-root: A best-practice pipeline for deep learning-based analysis of apoplastic pH in microscopy images of developmental zones in plant root tissue. bioRxiv, 2023.01.16.524272. DOI: 10.1101/2023.01.16.524272
- Wilkinson, M D, Dumontier, M, Aalbersberg, Ij J, Appleton, G, Axton, M, Baak, A, Blomberg, N, Boiten, J-W, da Silva Santos, L B, Bourne, P E, Bouwman, J, Brookes, A J, Clark, T, Crosas, M, Dillo, I, Dumon, O, Edmunds, S, Evelo, C T, Finkers, R, Gonzalez-Beltran, A, Gray, A J G, Groth, P, Goble, C, Grethe, J S, Heringa, J, ’t Hoen, P A C, Hooft, R, Kuhn, T, Kok, R, Kok, J, Lusher, S J, Martone, M E, Mons, A, Packer, A L, Persson, B, Rocca-Serra, P, Roos, M, van Schaik, R, Sansone, S-A, Schultes, E, Sengstag, T, Slater, T, Strawn, G, Swertz, M A, Thompson, M, van der Lei, J, van Mulligen, E, Velterop, J, Waagmeester, A, Wittenburg, P, Wolstencroft, K, Zhao, J and Mons, B 2016 The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1): 160018. DOI: 10.1038/sdata.2016.18
- Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L and Teal, T K 2017 Good enough practices in scientific computing. PLoS Computational Biology, 13(6):
e1005510 . DOI: 10.1371/journal.pcbi.1005510 - Wolf, T, Debut, L, Sanh, V, Chaumond, J, Delangue, C, Moi, A, Cistac, P, Rault, T, Louf, R, Funtowicz, M, Davison, J, Shleifer, S, von Platen, P, Ma, C, Jernite, Y, Plu, J, Xu, C, Le Scao, T, Gugger, S, Drame, M, Lhoest, Q and Rush, A 2020 Transformers: state-of-the-art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online:
Association for Computational Linguistics . pp. 38–45. DOI: 10.18653/v1/2020.emnlp-demos.6 - Ziade, T and Cordasco, I 2011 Flake8: Your tool for style guide enforcement. Available at
https://github.com/PyCQA/flake8 [Last accessed 22 November 2022]. - Ziemann, M, Poulain, P and Bora, A 2023 The five pillars of computational reproducibility: bioinformatics and beyond. Briefings in Bioinformatics, 24(6). DOI: 10.1093/bib/bbad375
DOI: https://doi.org/10.5334/dsj-2024-023 | Journal eISSN: 1683-1470
Language: English
Page range: 23 - 23
Submitted on: Dec 9, 2023
Accepted on: Apr 8, 2024
Published on: Apr 29, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
Keywords:
© 2024 Kenneth E. Schackart III, Heidi J. Imker, Charles E. Cook, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.