Engineering Manager, Fleet Reliability
On-site · Remote, Oregon, United States or US
Job Summary
Engineering Manager to build and run Fleet Reliability for a GPU-heavy production environment. Own 24/7 coverage for node provisioning, validation, and triage, drive the automation roadmap, and define SLAs to keep production GPUs serving traffic. Hire, develop, and retain the Fleet Reliability team; set the operating model and culture; lead high-uptime, automated, observable infrastructure. Role emphasizes hands-on leadership, incident management, automation-first mindset, and operating a scalable reliability function in a growing fleet. Remote work available.
Required Qualifications
- 7+ years in infrastructure, software, or SRE, with 2+ years leading
- Run a fleet reliability or hardware ops team in production
- Built SRE fundamentals into a team from scratch: incident management, postmortems, observability, change management
- Pushed teams toward automation over toil
- Carry the pager yourself before asking your team to
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.