Staff Software Engineer, Compute Architecture
$188,000–$275,000 year
On-site · New York City, New York, United States or Bellevue, Washington, United States
Job Summary
Staff Software Engineer in Compute Architecture to design, build, and operate Go-based services that manage the lifecycle of large-scale GPU data center infrastructure. Contribute to automation for data center bring-up, hardware discovery, health monitoring, remediation, and production operations; develop reliable APIs and workflows for hardware management, firmware state, server health, and rack-level infrastructure; improve observability, incident response, and reliability; collaborate with hardware-adjacent, infrastructure, operations, and software teams; provide technical leadership through design and code reviews, architectural guidance, and mentorship; make pragmatic architecture decisions balancing reliability, simplicity, and scalability. The role emphasizes reliability, scalability, and driving platform improvements across fleets of GPU servers in a large-scale datacenter environment.
Required Qualifications
- B.S., M.S., or PhD in Computer Science or related field, or equivalent experience
- 8+ years of software engineering experience with a strong focus on infrastructure, cloud engineering, and distributed databases—particularly within large-scale datacenter and cloud environments
- Expertise in Go and proven experience building REST/gRPC APIs for mission-critical platforms
- Strong background in architecting and scaling cloud-native Kubernetes infrastructure and distributed services
- Proven success in mentoring engineers, leading technical projects, and influencing engineering strategy across teams
- Experience contributing to and collaborating with open source communities
- Skilled in applying a data-driven approach to reliability, optimization, and continuous improvement
- Excellent communicator able to work effectively with both technical and non-technical stakeholders
- Hands-on experience with observability stacks (Prometheus, Grafana, PromQL), CI/CD pipelines, and operating large fleets of GPU servers
- Track record of leading incident response, postmortems, and driving robust service reliability
- Nice To Have Skills: Working knowledge of Kafka, ClickHouse and CRDB; DMTF, RedFish APIs, and GPU servers
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.