Senior Staff+ Software Engineer, Node Infra
$405,000–$485,000 year
Hybrid · New York City, New York, United States or San Francisco, California, United States
Job Summary
Own the technical strategy and roadmap for node lifecycle management—ingestion, bring-up, health checking, and automated repair; drive cross-team initiatives to scale AI clusters across multiple clouds and accelerator families; design and operate systems that detect, isolate, and remediate unhealthy hardware to maximize fleet reliability; define infrastructure architecture to solve hard problems, collaborating with cloud providers and internal teams to shape long-term compute, data, and infrastructure strategy; establish and evolve operational excellence practices (incident response, postmortem culture, on-call); mentor and coach engineers; contribute deep expertise in distributed systems, cloud platforms, and machine learning accelerators; seek bold improvements in reliability, scalability, and efficiency across large-scale compute infrastructure; location-based hybrid policy requiring some in-office time at NY or SF offices.
Required Qualifications
- Bachelor’s degree or equivalent
- 12+ years of software engineering experience (preferred)
- Experience with distributed systems and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure)
- Proficiency in at least one systems language (Rust, Go, Python)
- IaC proficiency with Terraform
- Hands-on experience with GPUs/TPUs/Trainium
- Ability to lead multi-quarter technical initiatives across teams
- Strong cross-team collaboration and communication skills
- Visa sponsorship available
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.