Service Reliability Eng
On-site · London, England, United Kingdom
Job Summary
Site Reliability Engineer responsible for the reliability, scalability and performance of global critical systems. Design, build, and maintain monitoring, alerting, and observability; automate infrastructure provisioning, deployments, and scaling; manage on-call incidents and drive post-incident reviews. Partner with engineering, IT, and security to embed SRE practices (SLOs, error budgets) into the lifecycle; ensure services connecting artists and fans remain available, scalable, and efficient.
Required Qualifications
- A strong background in systems administration (Linux/Windows) in a large-scale environment
- Proficiency in at least one programming language (e.g., Python, Go, Java)
- Hands-on experience with a major cloud platform (AWS, GCP, or Azure)
- Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible)
- Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace)
- Proven analytical and problem-solving abilities with experience in a high-pressure environment
- Excellent communication skills and the ability to foster a collaborative team environment
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.