Production Engineer, Compute
$175,000–$300,000 year
Remote · New York City, New York, United States or Seattle, Washington, United States
Job Summary
Production Engineer, Compute role focused on end-to-end compute fleet health and automation across GPU/TPU infrastructure. Own metrics pipelines, alerting, and a unified health view for Kubernetes‐orchestrated workloads and bare metal compute. Design and implement an automation-driven repair pipeline from detection through triage, parts management, and return to service; build the XPU qualification platform with burn-in, performance baselining, and NPI execution; own Redfish and BMC tooling, firmware telemetry, and fleet‐scale log collection. Drive end-to-end reliability, scalability, and operational discipline for one of the world’s largest XPU fleets, leveraging incident response, tooling, and AI-assisted development. Requirements emphasize toil as a bug mindset, hardware intuition at firmware/silicon levels, capability to operate under ambiguity, rapid learning, and proficiency with AI tooling and production automation (Go, Python). Bonus areas include hardware lifecycle/RMA automation, workflow engines, and metrics/alerting pipelines (Prometheus, Grafana).
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.