Lambda1 week ago

HPC Operations Engineer

Lambda

Hybrid · San Francisco, California, United States or Seattle, Washington, United States

San Francisco, California, United States or Seattle, Washington, United StatesHybridFull TimeMid LevelBachelors DegreeUnknown

Type

Full Time

Level

Mid Level

Education

Bachelors Degree

Company size

Unknown

Job Summary

HPC Operations Engineer responsible for remotely deploying and configuring large-scale HPC clusters for AI workloads, installing and configuring OS/firmware/software and networking on HPC clusters, and troubleshooting cluster issues in collaboration with on-site teams. Will define requirements to improve stability, simplicity, and operational efficiency; contribute to SOPs; provide updates to project leads; mentor junior engineers; and travel to North American data centers as needed. Requires deep HPC/AI architecture knowledge, Linux proficiency, experience with Infiniband and high-speed networks, and familiarity with job scheduling systems (SLURM/Kubernetes); experience with PyTorch/TensorFlow and containerization (Docker/Kubernetes) is a plus. The role is office-hybrid with a designated in-office presence four days per week and a remote workday. A Bachelor's degree in a related field and 5+ years of relevant experience are expected.

Required Qualifications

5+ years of experience deploying and configuring HPC clusters for AI workloads
Bachelor's degree in EE, CS, Physics, Mathematics, or equivalent work experience
Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
Experience with SFP+ fiber, Infiniband, and 100 GbE network fabrics
Proficiency with Linux-based compute nodes, firmware updates, and driver installation
Experience with SLURM, Kubernetes, or other job scheduling systems

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started