Coupang Internal1 month ago

Senior Staff Cloud Backend Engineer - Observability and Site Reliability

Coupang Internal

Hybrid · Bengaluru, Karnataka, India

Bengaluru, Karnataka, IndiaHybridFull TimeSenior LevelBachelors DegreeE-commerceLarge

Type

Full Time

Level

Senior Level

Education

Bachelors Degree

Company size

Large

Industry

E-commerce

Job Summary

Senior Staff Data Centre Observability and Site Reliability Engineer responsible for designing, building, and operating scalable observability and reliability solutions for large-scale datacenter infrastructure. Focus on developing high-performance monitoring and telemetry platforms, ensuring system reliability, and driving operational excellence through automation, performance optimization, and SRE best practices. Collaborates with cross-functional teams to enhance visibility, resilience, and efficiency of critical systems. Responsibilities include designing/implementing observability solutions (monitoring, logging, alerting, telemetry), building dashboards and reports, applying SRE principles, leading root cause analyses, optimizing performance, automating infrastructure provisioning, and ensuring security/compliance. Proficiencies include Go/Python, Kubernetes internals, Prometheus, Grafana, ELK, cloud platforms (AWS/Azure/GCP), and a hybrid work model with at least 3 days in office per week.

Required Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
12+ years of progressive software engineering experience, with a heavy emphasis on distributed systems, cloud-native architectures, or platform operations
Proven experience in managing and optimizing large-scale datacenter environments
Strong proficiency in Go or Python, with a deep understanding of networked systems and performance optimization
Expert-level knowledge of Kubernetes internals (scheduling, controllers) and containerization ecosystems
Proven experience with load balancing, service mesh, and request routing at scale
Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack)
Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform)
Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and reliability services

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started