Site Reliability Engineer
$180,000–$250,000 year
Remote · Sydney, New South Wales, Australia or AU
Job Summary
Own reliability, availability, and performance of production systems running in cloud environments; define and monitor SLIs/SLOs and help manage error budgets; lead incident response including detection, triage, mitigation, and postmortems; improve observability through logging, monitoring, alerting, and dashboards; automate operational workflows and reduce manual toil; partner closely with engineering teams to improve system resiliency and scalability; assist with capacity planning, infrastructure optimization, and performance tuning; build internal tooling, runbooks, and operational best practices; support Kubernetes-based infrastructure and distributed systems at scale; act as an escalation point for complex production and platform issues.
Required Qualifications
- 5+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related infrastructure roles
- Strong experience with cloud platforms such as AWS, GCP, or Azure
- Hands-on experience with Kubernetes and containerized environments
- Strong understanding of distributed systems and microservices architecture
- Experience with observability tools such as Prometheus, Grafana, Datadog, ELK, or OpenTelemetry
- Proficiency with infrastructure automation and scripting (Terraform, Python, Bash, etc.)
- Experience managing CI/CD pipelines and deployment automation
- Strong troubleshooting and incident management skills
- Ability to work cross-functional and communicate effectively during high-pressure situations
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.