Software Engineer, Infrastructure Reliability
$255,000–$405,000 year
On-site · San Francisco, California, United States
Job Summary
Join OpenAI's Applied AI Infrastructure team as a Software Engineer focusing on Infrastructure Reliability. You will design, build, and maintain reliable systems that enhance safety and performance. Responsibilities include identifying and fixing performance bottlenecks, improving automation, and collaborating with cross-functional teams to ensure system resilience. Candidates should have extensive knowledge of distributed systems, experience with Kubernetes, cloud infrastructure, and a passion for optimizing performance at scale.
Required Qualifications
- 4+ years of relevant industry experience
- 2+ years leading large scale, complex projects or teams as an engineer or tech lead
- Proven experience as a reliability engineer or production engineer
- Strong proficiency in programming / scripting languages
- Experience with containerization technologies
Desired Qualifications
- Experience operating orchestration systems such as Kubernetes at scale
- Strong proficiency in cloud infrastructure (like AWS, GCP, Azure)
- Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
- Experience with microservices architecture and service mesh technologies
Additional Requirements
- Background checks for applicants will be administered in accordance with applicable law
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.