NPL is looking for a Linux system administrator for its small, HPC cluster,Minerva.
Minerva is a HPC cluster running Rocky 8 with 8 CPU nodes (2x28 coreseach, Intel Icelake chipset) and 2 GPU nodes (4x NVIDIA A100-80 GB cards and1TB RAM each) with Infiniband interconnection. It was recently updated by our external partners, Cambridge Research Computing Ltd, who are also helping with the system administration as remote, 2nd line experts.
As the NPL side of system administration attached to the HPC cluster, your duties will be the following:
- Install scientific software as updates come through and are validated by CIO: MATLAB, COMSOL and other software of NPL-wide interest as modules.
- Work with the NPL user community to regularly evaluate their needs in terms of resources and evolution of the cluster as well as software installation requests and updates.
- Answer NPL user community questions about the HPC cluster, and more generally keep the user base informed through dedicated internal communication channels (Yammer/Teams/Mail_Minerva). Decide whether to maintain Mail_Minerva or not, knowing that neither Yammer nor Teams can guarantee that people will be aware/ have read your post.
- Keep a regular eye on the cluster and troubleshoot any unresponsive node.
- Work closely with Cambridge Research Computing Ltd (JIRA ticketing system) for unsolved problems affecting the cluster, run tests and schedule regular maintenances. Report back to user community about any downtime or event affecting the cluster (scheduled or not).
- Work with CIO to purchase any equipment to upgrade or extend the cluster.
- Work with other members of the Scientific Computing Team to organise internal communication and training, as well as maintaining the synergy with other NPL computing resources (archiving system).
- Interact with the greater HPC UK community: join HPC SIG as main NPL contact and ask to enrol to their sysadmin mentorship scheme.
- Update the cluster documentation: the current documentation was written before the upgrade and was valid for the previous operating system (Centos 7) and its associated software stack. The user documentation will need to be rewritten for the new software stack, including the new access application Open OnDemand.