JPMorgan Chase2 months ago

Lead Site Reliability Engineer

JPMorgan Chase

On-site · Columbus, Ohio, United States

Columbus, Ohio, United StatesOn-siteFull TimeSenior LevelLicense Or CertificationInvestment BankingEnterprise

Type

Full Time

Level

Senior Level

Education

License Or Certification

Company size

Enterprise

Industry

Investment Banking

Job Summary

Lead Site Reliability Engineer at JPMorgan Chase responsible for 24x7 production support and the reliability, scalability, and availability of mission-critical systems. Drive design and deployment approaches with automated CI/CD pipelines, implement infrastructure and network as code, collaborate with software engineers to improve deployment, monitoring, and incident response, and lead adoption of SRE best practices within the team. Provide on-call support and contribute to end-to-end operations, leveraging observability tools and large-scale telemetry to proactively resolve issues and optimize performance.

Required Qualifications

Formal training or certification in software engineering concepts with 10+ years of applied experience.
Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
Proficient in at least one programming language such as Python, Java/Spring Boot, and shell scripting.
Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Experience with continuous integration and continuous delivery tools like Jenkins, spinnaker, or Terraform – configuration management tools like SaltStack, ansible
Experience in managing, administering and supporting enterprise level large scale Splunk, ELK deployments catering application monitoring and observability to large number of applications
Experience in managing, administering and supporting vendor products such as Netcool, Grafana, SCOM
Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker
Experience with troubleshooting performance issues, common networking technologies and issues
Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision
Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation
Experience with large scale enterprise level event streaming platforms likes Kafka
Experience in handling critical incident and change management – be part of critical incident taskforce call.
Familiarity of agile practices – preferably, scrum and Kanban
Certifications (a plus)
AWS Certified SysOps Administrator or Professional, Certified Kubernetes Administrator (CKA), terraform associate level or equivalent

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started