Lead Site Reliability Engineer
On-site · Columbus, Ohio, United States
Job Summary
Lead Site Reliability Engineer at JPMorgan Chase responsible for 24x7 production support and the reliability, scalability, and availability of mission-critical systems. Drive design and deployment approaches with automated CI/CD pipelines, implement infrastructure and network as code, collaborate with software engineers to improve deployment, monitoring, and incident response, and lead adoption of SRE best practices within the team. Provide on-call support and contribute to end-to-end operations, leveraging observability tools and large-scale telemetry to proactively resolve issues and optimize performance.
Required Qualifications
- Formal training or certification in software engineering concepts with 10+ years of applied experience.
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Proficient in at least one programming language such as Python, Java/Spring Boot, and shell scripting.
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Experience with continuous integration and continuous delivery tools like Jenkins, spinnaker, or Terraform – configuration management tools like SaltStack, ansible
- Experience in managing, administering and supporting enterprise level large scale Splunk, ELK deployments catering application monitoring and observability to large number of applications
- Experience in managing, administering and supporting vendor products such as Netcool, Grafana, SCOM
- Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker
- Experience with troubleshooting performance issues, common networking technologies and issues
- Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision
- Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation
- Experience with large scale enterprise level event streaming platforms likes Kafka
- Experience in handling critical incident and change management – be part of critical incident taskforce call.
- Familiarity of agile practices – preferably, scrum and Kanban
- Certifications (a plus)
- AWS Certified SysOps Administrator or Professional, Certified Kubernetes Administrator (CKA), terraform associate level or equivalent
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.