Fal1 month ago

Operations Engineer, Fleet Reliability

Fal

On-site · Remote, Oregon, United States

Remote, Oregon, United StatesOn-siteFull TimeMid LevelNot SpecifiedEnterprise

Type

Full Time

Level

Mid Level

Education

Not Specified

Company size

Enterprise

Job Summary

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters; troubleshoot hardware and software issues across compute, network, and storage; monitor fleet health, take remediation action, push fixes upstream when needed; write the runbooks and improve or delete outdated ones; on-call responsibilities in a hands-on operations role.

Required Qualifications

Administered Linux Systems in the critical path
Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
Experience in observability systems like Grafana and Prometheus
Scripting: bash, python, go, or equivalent

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started