Lambda logo
Lambda1 week ago

HPC Operations Engineer

Hybrid · San Francisco, California, United States or Seattle, Washington, United States

Type
Full Time
Level
Mid Level
Education
Bachelors Degree
Company size
Unknown

Job Summary

HPC Operations Engineer responsible for remotely deploying and configuring large-scale HPC clusters for AI workloads, installing and configuring OS/firmware/software and networking on HPC clusters, and troubleshooting cluster issues in collaboration with on-site teams. Will define requirements to improve stability, simplicity, and operational efficiency; contribute to SOPs; provide updates to project leads; mentor junior engineers; and travel to North American data centers as needed. Requires deep HPC/AI architecture knowledge, Linux proficiency, experience with Infiniband and high-speed networks, and familiarity with job scheduling systems (SLURM/Kubernetes); experience with PyTorch/TensorFlow and containerization (Docker/Kubernetes) is a plus. The role is office-hybrid with a designated in-office presence four days per week and a remote workday. A Bachelor's degree in a related field and 5+ years of relevant experience are expected.

Required Qualifications

  • 5+ years of experience deploying and configuring HPC clusters for AI workloads
  • Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience
  • Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
  • Experience with SFP+ fiber, Infiniband, and 100 GbE network fabrics
  • Proficiency with Linux-based compute nodes, firmware updates, and driver installation
  • Experience with SLURM, Kubernetes, or other job scheduling systems
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Lambda

HPC Operations Engineer

Apply on Sorce