Have a personal or library account? Click to login
NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction Cover

NØMAD: Lightweight HPC Monitoring and Diagnostics with Machine Learning-Based Failure Prediction

Open Access
|Mar 2026

Figures & Tables

Figure 1

NØMAD system architecture showing the data flow from collectors through the analysis engines to the alert dispatcher and web dashboard. The monitoring engine handles threshold-based alerts while the prediction engine uses ML models for proactive failure detection.

Figure 2

NØMAD web dashboard in light theme showing real-time cluster status. The main view displays three partitions (compute, highmem, gpu) with node health rings indicating CPU utilization. The sidebar shows detailed statistics for the selected node including job counts, resource utilization, and top users.

Figure 3

Usage analytics views. Top: Resource usage summary showing CPU-hours, GPU-hours, and breakdown by group and user. Bottom: Activity heatmap displaying job submissions by day and hour, revealing usage patterns.

Figure 4

Interactive computing sessions dashboard showing active RStudio and Jupyter sessions with memory usage, session age, and status.

Figure 5

Infrastructure monitoring views. Top: Workstation overview showing departmental machines (Biology, Chemistry, Physics, Math/CS) with status, CPU load, memory, disk usage, and logged-in users. Bottom: Storage overview displaying NFS servers with capacity, usage, ZFS pool health, and connected clients.

Figure 6

Job similarity network visualization with feature variance analysis. The 3D view shows jobs colored by outcome (green = completed, red = failed, orange = timeout, purple = OOM). The left panel displays feature variance statistics sorted by coefficient of variation (CV); high CV values for exit_signal and failure_reason (exceeding 200%) reflect normal cluster operation where most jobs succeed. The legend shows job outcome distribution across 1,000 jobs.

Figure 7

Educational analytics outputs. Top left (nomad edu explain 1104): Job explanation with proficiency scores and recommendations. Top right (nomad edu trajectory alice): User trajectory tracking improvement over 173 jobs. Bottom (nomad edu report cs101): Group report for a course section with per-student breakdown.

DOI: https://doi.org/10.5334/jors.686 | Journal eISSN: 2049-9647
Language: English
Submitted on: Jan 22, 2026
|
Accepted on: Feb 24, 2026
|
Published on: Mar 12, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 João Filipe Riva Tonini, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.