Have a personal or library account? Click to login
NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction Cover

NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction

Open Access
|Mar 2026

References

  1. Prometheus Authors. Prometheus: From metrics to insight; 2012. https://prometheus.io/.
  2. Grafana Labs. Grafana: The open observability platform; 2014. https://grafana.com/.
  3. Galstad E. Nagios: The industry standard in IT infrastructure monitoring. https://www.nagios.org/.
  4. Evans RT, Browne JC, Barth WL. Comprehensive resource use monitoring for HPC systems with TACC Stats. In: Proceedings of the First International Workshop on HPC User Support Tools. IEEE; 2014. pp. 1321. DOI: 10.1109/HUST.2014.7
  5. Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Inber M, Guillen O, Cornelius CD. Open XDMoD: A tool for the comprehensive management of high-performance computing resources. Computing in Science & Engineering. 2015;17(4):5262. DOI: 10.1109/MCSE.2015.68
  6. Agelastos A, Allan B, Brandt J, Cassella P, Enos J, Fullop J, Gentile A, Monk S, Naksinehaboon N, Ogden J, Rajan M, Showerman M, Stevenson J, Taerat N, Tucker T. The Lightweight Distributed Metric Service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE; 2014. pp. 154165. DOI: 10.1109/SC.2014.18
  7. Tuncer O, Ates E, Zhang Y, Turber A, Brandt J, Leung VJ, Egele M, Coskun AK. Diagnosing performance variations in HPC applications using machine learning. In: High Performance Computing, Lecture Notes in Computer Science, vol. 10266. Springer; 2017. pp. 355373. DOI: 10.1007/978-3-319-58667-0_19
  8. Vilhena DA, Antonelli A. A network approach for identifying and delimiting biogeographical regions. Nature Communications. 2015;6:6848. DOI: 10.1038/ncomms7848
  9. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:28252830. http://jmlr.org/papers/v12/pedregosa11a.html.
  10. Yoo AB, Jette MA, Grondona M. SLURM: Simple Linux Utility for Resource Management. In: Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, vol. 2862. Springer; 2003. pp. 4460. DOI: 10.1007/10968987_3
DOI: https://doi.org/10.5334/jors.686 | Journal eISSN: 2049-9647
Language: English
Submitted on: Jan 22, 2026
|
Accepted on: Feb 24, 2026
|
Published on: Mar 12, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 João Filipe Riva Tonini, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.