Fal logo
Fal1 month ago

Engineering Manager, Fleet Reliability

On-site · Remote, Oregon, United States or US

Type
Full Time
Level
Senior Level
Education
Not Specified
Company size
Enterprise

Job Summary

Engineering Manager to build and run Fleet Reliability for a GPU-heavy production environment. Own 24/7 coverage for node provisioning, validation, and triage, drive the automation roadmap, and define SLAs to keep production GPUs serving traffic. Hire, develop, and retain the Fleet Reliability team; set the operating model and culture; lead high-uptime, automated, observable infrastructure. Role emphasizes hands-on leadership, incident management, automation-first mindset, and operating a scalable reliability function in a growing fleet. Remote work available.

Required Qualifications

  • 7+ years in infrastructure, software, or SRE, with 2+ years leading
  • Run a fleet reliability or hardware ops team in production
  • Built SRE fundamentals into a team from scratch: incident management, postmortems, observability, change management
  • Pushed teams toward automation over toil
  • Carry the pager yourself before asking your team to
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Fal

Engineering Manager, Fleet Reliability

Apply on Sorce