OpenAI logo
OpenAI1 month ago

Software Engineer, Hardware Health

$250,000–$445,000 year

On-site · San Francisco, California, United States

Type
Full Time
Level
Senior Level
Education
Not Specified
Company size
Large
Industry
AI Services

Job Summary

OpenAI is seeking a Software Engineer, Hardware Health to build critical infrastructure that keeps OpenAI’s largest compute clusters healthy and operational at scale. You’ll define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure; build and evolve health checks to detect, remediate, and verify failures at scale; ensure health checks run with minimal latency to maximize uptime; investigate hardware failures across large-scale compute environments; manage node lifecycle workflows (drain, quarantine, repair, RMA, return-to-service); and develop automation and tooling for global cluster management, partnering with workload, reliability, and provider teams to integrate health signals into training and inference systems.

Required Qualifications

  • 7+ years of industry experience in software or infrastructure engineering
  • Strong proficiency with Python and shell scripting
  • Experience building large-scale distributed systems or infrastructure platforms
  • Comfort digging into noisy operational data using SQL, PromQL, or similar tooling
  • Experience building reproducible analyses and operational tooling
  • Strong systems debugging and operational instincts with an ownership mindset
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

$250k – $445k / yr

Software Engineer, Hardware Health · OpenAI

Apply on Sorce