Site Reliability Engineer (SRE)
On-site · Bengaluru, Karnataka, India
Job Summary
Design and scale observability frameworks (metrics, logs, traces, event streams) across cloud environments. Define and manage SLIs/SLOs to ensure high availability, performance, and reliability. Build proactive, AI-driven monitoring systems to detect anomalies and predict failures. Develop automation and self-healing capabilities to reduce manual intervention and improve system resilience. Enable event-driven operations, integrating with tools like ServiceNow, PagerDuty, and Slack. Collaborate with engineering, SecOps, and FinOps teams to improve reliability, security, and cost efficiency. You have 8+ years of SRE/Cloud/Platform Engineering experience and are proficient with Prometheus, Grafana, Datadog, OpenTelemetry, CloudWatch; coding in Python/Go/Bash; and deploying Docker/Kubernetes in cloud-native environments. We work in-office at least 3 days per week; onsite role in Bangalore, India.
Required Qualifications
- 8+ years in SRE/Cloud/Platform Engineering with AWS production environment experience
- Expertise in Prometheus, Grafana, Datadog, OpenTelemetry, CloudWatch, and managing SLIs/SLOs
- Strong skills in Python, Go, or Bash
- Experience with distributed systems, microservices, Docker, and Kubernetes
- Knowledge of event-driven operations and incident tools (ServiceNow, PagerDuty, Slack)
- Cross-functional collaboration experience and drive for reliability, security, and cost optimization
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.