Fal logo
Fal1 month ago

Operations Engineer, Fleet Reliability

On-site · Remote, Oregon, United States

Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Enterprise

Job Summary

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters; troubleshoot hardware and software issues across compute, network, and storage; monitor fleet health, take remediation action, push fixes upstream when needed; write the runbooks and improve or delete outdated ones; on-call responsibilities in a hands-on operations role.

Required Qualifications

  • Administered Linux Systems in the critical path
  • Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
  • Experience in observability systems like Grafana and Prometheus
  • Scripting: bash, python, go, or equivalent
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Fal

Operations Engineer, Fleet Reliability

Apply on Sorce