HPC Operations Engineer
Hybrid · San Francisco, California, United States or Seattle, Washington, United States
Job Summary
HPC Operations Engineer responsible for remotely deploying and configuring large-scale HPC clusters for AI workloads, installing and configuring OS/firmware/software and networking on HPC clusters, and troubleshooting cluster issues in collaboration with on-site teams. Will define requirements to improve stability, simplicity, and operational efficiency; contribute to SOPs; provide updates to project leads; mentor junior engineers; and travel to North American data centers as needed. Requires deep HPC/AI architecture knowledge, Linux proficiency, experience with Infiniband and high-speed networks, and familiarity with job scheduling systems (SLURM/Kubernetes); experience with PyTorch/TensorFlow and containerization (Docker/Kubernetes) is a plus. The role is office-hybrid with a designated in-office presence four days per week and a remote workday. A Bachelor's degree in a related field and 5+ years of relevant experience are expected.
Required Qualifications
- 5+ years of experience deploying and configuring HPC clusters for AI workloads
- Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience
- Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
- Experience with SFP+ fiber, Infiniband, and 100 GbE network fabrics
- Proficiency with Linux-based compute nodes, firmware updates, and driver installation
- Experience with SLURM, Kubernetes, or other job scheduling systems
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.