Operations Engineer, Fleet Reliability
On-site · Remote, Oregon, United States
Remote, Oregon, United StatesOn-siteFull TimeMid LevelNot SpecifiedEnterprise
Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Enterprise
Job Summary
Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters; troubleshoot hardware and software issues across compute, network, and storage; monitor fleet health, take remediation action, push fixes upstream when needed; write the runbooks and improve or delete outdated ones; on-call responsibilities in a hands-on operations role.
Required Qualifications
- Administered Linux Systems in the critical path
- Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
- Experience in observability systems like Grafana and Prometheus
- Scripting: bash, python, go, or equivalent
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.