Senior Site Reliability Engineer (Remote USA)
$149,100–$157,800 year
Remote · Denver, Colorado, United States or US
Job Summary
Senior Site Reliability Engineer role (Remote USA) focused on owning platform reliability and AI operations for a high-scale semiconductor data platform. Responsibilities include owning SLOs/SLIs and error budgets for production services, designing reliability patterns for AI agent pipelines, managing blast radius containment, maturing active-active architecture toward 24-hour RTO, leading incident response and post-incident reviews, and enabling software/AI engineering teams with CI/CD standards, IDP adoption, and self-service tooling. Requires deep SRE expertise, leadership at senior IC level, and strong experience with AWS, Terraform/GitOps, Datadog observability, Docker/Kubernetes, Python/Bash, and CI/CD workflows. Preference for experience with agentic AI systems, AI workloads observability, and IDP tooling; remote in US with occasional travel.
Required Qualifications
- Bachelor's degree in Computer Science, Engineering, or equivalent
- 6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps
- Deep expertise in AWS (EKS, Lambda, CloudWatch) and multi-region architecture
- Proficiency with Terraform and GitOps
- Hands-on Datadog experience
- Strong containerization with Docker and Kubernetes (EKS preferred)
- Proficiency in Python and/or Bash; knowledge of Java and Spring Boot
- Experience with CI/CD pipelines (Bitbucket Pipelines, GitHub Actions)
- Familiarity with IDP tooling (Backstage, Atlassian Compass) preferred
- Experience with AI/ML workload infrastructure or agentic system operations considered a strong asset
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.