JPMorgan Chase3 weeks ago

Senior Lead Site Reliability Engineer

JPMorgan Chase

On-site · Glasgow, Scotland, United Kingdom

Glasgow, Scotland, United KingdomOn-siteFull TimeSenior LevelNot SpecifiedFinancial ServicesEnterprise

Type

Full Time

Level

Senior Level

Education

Not Specified

Company size

Enterprise

Industry

Financial Services

Job Summary

Senior Lead Site Reliability Engineer responsible for guiding reliability, observability, and performance across large-scale platforms; develop production code for reliability tooling and telemetry pipelines; lead incidents and drive blameless postmortems; define and implement SLOs/SLIs; design, deploy, and maintain OpenTelemetry-based telemetry ingestion and processing in hybrid on-prem/cloud environments with backends like InfluxDB, Prometheus, Elasticsearch, and OpenSearch; migrate legacy telemetry to standardized instrumentation; mentor engineers and influence broader engineering practices to advance observability and reliability technologies; requires strong programming, cloud-native, container orchestration, and incident-response skills.

Required Qualifications

Formal training or certification on software engineering concepts
Advanced knowledge of reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Advanced proficiency in one or more programming languages (e.g., Java, Python, Go)
Advanced proficiency with observability tools (Grafana, Dynatrace, Prometheus, Datadog, Splunk, Elasticsearch, OpenSearch)
Proficiency with CI/CD tools (Jenkins, GitLab, Terraform)
Experience with container orchestration (ECS, Kubernetes, Docker)
Hands-on experience with OpenTelemetry collectors in production
Ability to tackle reliability design independently
Practical cloud native experience
Ability to collaborate across stakeholder groups
Knowledge of distributed tracing, metrics, and logging best practices
Certification in AWS, Kubernetes, or relevant technologies
Track record in system health monitoring, capacity management, blameless postmortems
Understanding of distributed system design principles, networking, and Linux internals
Contributions to open-source observability or telemetry projects
Experience with agent control planes and management protocols such as OpAMP

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started