NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction

João Filipe Riva Tonini

doi:10.5334/jors.686

Abstract

NØMAD (NOde Monitoring And Diagnostics) is a lightweight monitoring and predictive analytics tool designed for computing infrastructure that requires minimal deployment overhead. At its core, metric collectors scheduled via systemd timers gather system metrics—disk, CPU, memory, I/O, and GPU utilization—from standard Linux tools, storing everything in a single SQLite database. When SLURM is available, these collectors extend to capture job-level analytics from scheduler commands and per-process I/O statistics, enabling deeper insight into workload behavior. The system employs machine learning (ML) models to predict job failures before they occur, while a data readiness estimator helps administrators determine when sufficient historical data has been collected for reliable predictions. Beyond prediction, diagnostic tools provide targeted analysis of network performance, storage health, and node-level bottlenecks. A key innovation is modeling jobs as nodes in a similarity network, where edges connect jobs with comparable resource usage patterns; because jobs with similar characteristics tend to share similar outcomes, the network structure reveals failure-prone regions in the feature space. The tool includes a real-time web dashboard for visualization and supports alerts via email, Slack, or webhooks. By requiring no external databases or complex infrastructure, NØMAD is particularly well-suited for small-to-medium high-performance computing (HPC) centers and research groups seeking sophisticated monitoring without enterprise-scale overhead.

References

Prometheus Authors. Prometheus: From metrics to insight; 2012. https://prometheus.io/.
Search in Google Scholar Back to article
Grafana Labs. Grafana: The open observability platform; 2014. https://grafana.com/.
Search in Google Scholar Back to article
Galstad E. Nagios: The industry standard in IT infrastructure monitoring. https://www.nagios.org/.
Search in Google Scholar Back to article
Evans RT, Browne JC, Barth WL. Comprehensive resource use monitoring for HPC systems with TACC Stats. In: Proceedings of the First International Workshop on HPC User Support Tools. IEEE; 2014. pp. 13–21. DOI: 10.1109/HUST.2014.7
Open DOI Search in Google Scholar Back to article
Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Inber M, Guillen O, Cornelius CD. Open XDMoD: A tool for the comprehensive management of high-performance computing resources. Computing in Science & Engineering. 2015;17(4):52–62. DOI: 10.1109/MCSE.2015.68
Open DOI Search in Google Scholar Back to article
Agelastos A, Allan B, Brandt J, Cassella P, Enos J, Fullop J, Gentile A, Monk S, Naksinehaboon N, Ogden J, Rajan M, Showerman M, Stevenson J, Taerat N, Tucker T. The Lightweight Distributed Metric Service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE; 2014. pp. 154–165. DOI: 10.1109/SC.2014.18
Open DOI Search in Google Scholar Back to article
Tuncer O, Ates E, Zhang Y, Turber A, Brandt J, Leung VJ, Egele M, Coskun AK. Diagnosing performance variations in HPC applications using machine learning. In: High Performance Computing, Lecture Notes in Computer Science, vol. 10266. Springer; 2017. pp. 355–373. DOI: 10.1007/978-3-319-58667-0_19
Open DOI Search in Google Scholar Back to article
Vilhena DA, Antonelli A. A network approach for identifying and delimiting biogeographical regions. Nature Communications. 2015;6:6848. DOI: 10.1038/ncomms7848
Open DOI Search in Google Scholar Back to article
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html.
Search in Google Scholar Back to article
Yoo AB, Jette MA, Grondona M. SLURM: Simple Linux Utility for Resource Management. In: Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, vol. 2862. Springer; 2003. pp. 44–60. DOI: 10.1007/10968987_3
Open DOI Search in Google Scholar Back to article

NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction

Abstract

Paradigm

My account