Coupang Logistics Services5 days ago

Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

Coupang Logistics Services

On-site · Seoul, Seoul, South Korea

Seoul, Seoul, South KoreaOn-siteFull TimeSenior LevelMasters DegreeE-commerce ServicesUnknown

Type

Full Time

Level

Senior Level

Education

Masters Degree

Company size

Unknown

Industry

E-commerce Services

Job Summary

Lead the design and evolution of a GPU-accelerated observability platform, owning end-to-end telemetry pipelines and dashboards for GPU clusters, datacenters, and distributed workloads. Architect scalable metric pipelines (Grafana Alloy → Mimir) and log pipelines (Datadog Vector → Loki), define multi-year observability roadmaps, establish GPU-specific SLIs/SLOs, and drive adoption of SRE principles across the team. Build rich Grafana dashboards for real-time fleet health, capacity planning, and tenant-level usage/cost insights; enable self-service observability for platform and ML engineering teams; mentor engineers and participate in design reviews. Collaborate on integrations across Grafana stack (Alloy, Mimir, Loki, Tempo), OpenTelemetry, and Terraform/Kubernetes CI/CD pipelines, with a focus on secure, multi-tenant telemetry pipelines and high-throughput, low-latency data paths. Required capabilities include deep knowledge of GPU hardware (NVIDIA DCGM, MIG), Kubernetes GPU operators, high-performance networking (RDMA/InfiniBand), and Go/Python programming.

Required Qualifications

BS/MS in Computer Science or equivalent practical experience
Extensive experience in Observability, SRE, or Distributed Infrastructure
Proven track record building large-scale telemetry pipelines (metrics/logs)
Strong programming skills in Go or Python
Experience with GPU systems (NVIDIA DCGM, CUDA ecosystem) and high-performance networking (RDMA, InfiniBand)
Experience with Grafana Alloy / Prometheus ecosystem, Grafana Mimir, Grafana Loki, Datadog Vector or similar
Kubernetes, Linux internals, cloud & hybrid environments
Ability to design, implement, and operate end-to-end telemetry pipelines (metrics/logs) and dashboards
Ability to define GPU-specific SLIs/SLOs and drive adoption of SRE practices
Experience with CI/CD integration and IaC tools (Terraform, Kubernetes)
Strong problem-solving and RCA capabilities
Mentorship and design-review leadership
Security considerations: encryption in transit/rest, multi-tenant RBAC

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started