Operations Engineer, HPC Networking
Remote · United States or US
Job Summary
Operate InfiniBand and Ethernet fabrics in production; bring up new fabrics alongside DC ops, monitor fabric health, diagnose issues (link flaps, congestion, NCCL stalls, firmware bugs at scale), and perform maintenance and upgrades. Responsibilities include monitoring health and performance of InfiniBand and Ethernet fabrics (switches, HCAs, transceivers, links); investigating and resolving connectivity, congestion, and performance regressions; supporting fabric bring-up with DC ops and customer-facing teams; improving tooling and runbooks to speed incident resolution. Skills include operating InfiniBand fabrics in production (subnet manager, routing, partitioning, monitoring); debugging the full stack (cables, transceivers, switch firmware, HCAs, drivers, NCCL); bringing up new fabrics from validation; scripting in bash, python, or go; nice-to-have Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking. Personal traits: detail-oriented, calm under fire, proactive in root-cause analysis, and committed to reducing outages.
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.