Fal logo
Fal1 month ago

Operations Engineer, HPC Networking

Remote · United States or US

Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Enterprise

Job Summary

Operate InfiniBand and Ethernet fabrics in production; bring up new fabrics alongside DC ops, monitor fabric health, diagnose issues (link flaps, congestion, NCCL stalls, firmware bugs at scale), and perform maintenance and upgrades. Responsibilities include monitoring health and performance of InfiniBand and Ethernet fabrics (switches, HCAs, transceivers, links); investigating and resolving connectivity, congestion, and performance regressions; supporting fabric bring-up with DC ops and customer-facing teams; improving tooling and runbooks to speed incident resolution. Skills include operating InfiniBand fabrics in production (subnet manager, routing, partitioning, monitoring); debugging the full stack (cables, transceivers, switch firmware, HCAs, drivers, NCCL); bringing up new fabrics from validation; scripting in bash, python, or go; nice-to-have Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking. Personal traits: detail-oriented, calm under fire, proactive in root-cause analysis, and committed to reducing outages.

Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Fal

Operations Engineer, HPC Networking

Apply on Sorce