Site Reliability Engineering Lead
On-site · Hyderabad, Telangana, India
Job Summary
Lead Site Reliability Engineering at Experian to define and implement SRE best practices across the organization. Mentor and guide an SRE team, align reliability goals with business objectives, and drive initiatives to improve system resilience and reduce toil. Own incident management from escalation through root cause analysis and postmortems, and champion observability through monitoring, logging, and alerting. Architect and operate self-healing, resilient systems using tools like Prometheus, Grafana, ELK, and AWS CloudWatch, including exposure to chaos engineering concepts (Gremlin, Chaos Monkey, AWS FIS) for outage simulation. Collaborate with development teams to integrate reliability into design and deployment, communicate with technical and non-technical stakeholders, and lead efforts to scale reliability in a high-volume, regulated environment. Requires a degree in CS/engineering, 12+ years in software development, and 5+ years leading an SRE team, with deep AWS expertise and strong leadership capable of motivating distributed teams.
Required Qualifications
- Degree in Computer Science or related field: B.Sc. in Computer Science, MCA in Computer Science, Bachelor of Technology in Engineering, or higher
- Minimum 12 years of software development experience
- At least 5 years of experience leading an SRE team
- Strong leadership capabilities and experience managing geographically distributed teams
- Deep expertise with AWS services and monitoring/observability tools
- Proven ability to define and drive reliability practices (SLIs/SLOs/SLAs) and incident management
- Experience building secure, mission-critical, high-volume transaction systems (preferably in regulated industries such as finance/insurance)
- Excellent communication skills with both technical and non-technical stakeholders
- Hands-on technologist with strong technical foundation and ability to contribute as an individual contributor as well as a team leader
- Experience with automation, runbooks, disaster recovery, and self-healing systems
- Familiarity with observability tooling (Prometheus, Grafana, ELK, AWS CloudWatch) and incident response processes
- Knowledge of AIOps or ML-based anomaly detection is a plus
- Ability to work with cross-functional teams to embed reliability into software design and deployment
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.