High Performance Computing
Quick Literature Review
To address high performance computing in 2012 researchers from the University of Modena and Reggio Emilia in Italy proposed the idea of having a cluster based real-time event monitoring system (Andreolini, Colajanni, & Pietri, 2012). For those of us with a HPC background the importance of real time performance analytics, and the concept of redundant compute clusters seem like an obvious solution, but as Andreolini et al. (2012) found this was not the case in monitoring solutions. The researcher looked at various Linux monitoring solutions and found that none were resilient to failure and manual interaction was needed to recover from system failures or overloads (Andreolini, 2012). Anyone that has worked in monitoring has addressed this problem node polling times eventually coalesce to a single point in time creating system resource spikes that either freeze the box causing irregularities in the polling data until the box recovers, or causes a failure that forces manual interaction often in the form of a restart.
In 2011 Rheinheimer reported on the state of the Zenoss based monitoring solution at the Los Alamos National Labratory (LANL) they had developed to address high performance computing data collection. According to Martin, and Johnson (2011) from LANL all of the standard monitoring solution architectures could not keep up with HPC data collection. Los Alamos National Laboratories used a modified version of Zenoss 3.0 and a tiered architecture to build a solution that could handle the performance needs of the HPC environment (Rheinheimer, 2011; Martin, & Johnson 2011).
My testing is currently being done on a solution that would use version 5 of Zenoss for its implementation of application layer virtualization through Docker to increase both the scalability and resiliency problems in high performance computing. LANL had to increase node count to decrease the percentage of failure due to oversubscription during data point collection on a per collector basis. The attachment of devices to a specific collector meant that cluster nodes without a workload had to be accommodated and each nodes attachment to a specific collector meant that there was a single point of failure for collection. As collector count increases to accommodate a growing HPC cluster the number of collectors increase creating a greater likelihood of failure due to increased footprint. The concept of a cluster pool with N+1 virtualized hosts allows for resiliency at the hardware layer, hypervisor layer, and application process layer decreasing the chances of a failure. The concept of HPC nodes attached to a pooled set of collectors instead of a single box would allow one to plan for the highest usage of the HPC nodes attached to the pool rather than the highest usage of the smaller number of HPC nodes attached to a single device. Also this architecture would decrease the numbers of collector nodes necessary to achieve N+1 redundancy. The entire pool size is decreased by the application containerizations ability to realign services throughout the pool allowing N to be significantly smaller than other solutions while still retaining high polling intervals. The HPC device to collector pool allocation is a function of the job scheduler to decrease the likelyhood of over commitment to any specific pool in in the event that there is a system failure and the pool has to degrade to N (rather than N+1) status).
HPC data collection has been achieved without any over commitment of a single pool and without any loss of data using a 5 second polling interval for all Linux SNMP based collection metrics. SSH polling metrics have been achieved consistently at every 30 seconds. Networking and infrastructure monitoring are also reliable at 5 second SNMP polling metrics. For a 200 node HPC farm running semiconductor regressions the the ratio of necessary pool hosts to HPC nodes was 4 to 57. Running 5 to 57 allowed for redundancy.
This is just preliminary testing on a small cluster, access to a larger compute farm and a different workload would allow for better stronger results. This could potentially be applied to compute farms that are looking to analyze workload performance on machines by increasing the level of data that is available to the HPC engineers. While there is much work yet to be done, I hope this will spark further research into HPC monitoring.