Site Reliability Engineer
$100,000–$170,000 year
Remote · United States
Job Summary
Build and improve automation, tooling, and infrastructure for AI workloads; define and maintain basic SLOs/SLIs and monitoring dashboards; participate in incident response and post-incident reviews; collaborate with Engineering, Networking, and Infrastructure teams to improve system stability; learn from senior engineers and grow in reliability engineering; exposure to cloud/Kubernetes/HPC and AI/gpu workloads; competitive base salary plus equity/bonus programs and flexible work environment.
Required Qualifications
- 2–5 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering in Data Center Environment
- 2+ years programming skills (e.g., Python, Go, or similar) with interest in automation and tooling
- Working knowledge of Linux systems, networking concepts, and distributed systems
- Experience troubleshooting system or application issues in production environments
- Familiarity with monitoring or observability tools (e.g., logs, metrics, dashboards)
- Strong willingness to learn and improve reliability and operational practices
- Ability to work in fast-paced environments and collaborate across teams
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.