Service Reliability Engineer
On-site · Sydney, New South Wales, Australia
Job Summary
Site Reliability Engineer at Universal Music Group in Sydney leads reliability, scalability, and performance of global services within a follow-the-sun framework aligned to Australian business hours. Focus areas include building robust monitoring/observability (CloudWatch, Dynatrace), automating deployments and scaling, maintaining CI/CD pipelines, incident management, root cause analysis, and embedding SRE best practices (SLOs, error budgets) across engineering teams. Requires Linux/Windows systems administration, programming (Python/Go/Java), cloud (AWS preferred), containers (Docker/Kubernetes), and IaC (Terraform/Ansible); familiarity with Prometheus, Grafana, Datadog, Splunk, and Dynatrace; strong problem-solving and communication skills; plus willingness to work a Mon-Sun roster with weekend office presence.
Required Qualifications
- Strong background in systems administration (Linux/Windows) in a large-scale environment
- Proficiency in at least one programming language (Python, Go, Java)
- Hands-on experience with a major cloud platform (prefer AWS) (AWS, GCP, or Azure)
- Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (Terraform, Ansible)
- Experience with modern monitoring and observability tools (Prometheus, Grafana, Datadog, Splunk, Dynatrace)
- Proven analytical and problem-solving abilities in high-pressure environments
- Excellent communication skills and ability to foster a collaborative team environment
- Bachelor's degree in an IT-related field (preferred)
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.