Experian4 days ago

Site Reliability Engineering Lead

Experian

On-site · Hyderabad, Telangana, India

Hyderabad, Telangana, IndiaOn-siteFull TimeSenior LevelBachelors DegreeEnterprise

Type

Full Time

Level

Senior Level

Education

Bachelors Degree

Company size

Enterprise

Job Summary

Lead Site Reliability Engineering at Experian to define and implement SRE best practices across the organization. Mentor and guide an SRE team, align reliability goals with business objectives, and drive initiatives to improve system resilience and reduce toil. Own incident management from escalation through root cause analysis and postmortems, and champion observability through monitoring, logging, and alerting. Architect and operate self-healing, resilient systems using tools like Prometheus, Grafana, ELK, and AWS CloudWatch, including exposure to chaos engineering concepts (Gremlin, Chaos Monkey, AWS FIS) for outage simulation. Collaborate with development teams to integrate reliability into design and deployment, communicate with technical and non-technical stakeholders, and lead efforts to scale reliability in a high-volume, regulated environment. Requires a degree in CS/engineering, 12+ years in software development, and 5+ years leading an SRE team, with deep AWS expertise and strong leadership capable of motivating distributed teams.

Required Qualifications

Degree in Computer Science or related field: B.Sc. in Computer Science, MCA in Computer Science, Bachelor of Technology in Engineering, or higher
Minimum 12 years of software development experience
At least 5 years of experience leading an SRE team
Strong leadership capabilities and experience managing geographically distributed teams
Deep expertise with AWS services and monitoring/observability tools
Proven ability to define and drive reliability practices (SLIs/SLOs/SLAs) and incident management
Experience building secure, mission-critical, high-volume transaction systems (preferably in regulated industries such as finance/insurance)
Excellent communication skills with both technical and non-technical stakeholders
Hands-on technologist with strong technical foundation and ability to contribute as an individual contributor as well as a team leader
Experience with automation, runbooks, disaster recovery, and self-healing systems
Familiarity with observability tooling (Prometheus, Grafana, ELK, AWS CloudWatch) and incident response processes
Knowledge of AIOps or ML-based anomaly detection is a plus
Ability to work with cross-functional teams to embed reliability into software design and deployment

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started