Coupang Logistics Services logo
Coupang Logistics Services5 days ago

Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

On-site · Seoul, Seoul, South Korea

Type
Full Time
Level
Senior Level
Education
Masters Degree
Company size
Unknown
Industry
E-commerce Services

Job Summary

Lead the design and evolution of a GPU-accelerated observability platform, owning end-to-end telemetry pipelines and dashboards for GPU clusters, datacenters, and distributed workloads. Architect scalable metric pipelines (Grafana Alloy → Mimir) and log pipelines (Datadog Vector → Loki), define multi-year observability roadmaps, establish GPU-specific SLIs/SLOs, and drive adoption of SRE principles across the team. Build rich Grafana dashboards for real-time fleet health, capacity planning, and tenant-level usage/cost insights; enable self-service observability for platform and ML engineering teams; mentor engineers and participate in design reviews. Collaborate on integrations across Grafana stack (Alloy, Mimir, Loki, Tempo), OpenTelemetry, and Terraform/Kubernetes CI/CD pipelines, with a focus on secure, multi-tenant telemetry pipelines and high-throughput, low-latency data paths. Required capabilities include deep knowledge of GPU hardware (NVIDIA DCGM, MIG), Kubernetes GPU operators, high-performance networking (RDMA/InfiniBand), and Go/Python programming.

Required Qualifications

  • BS/MS in Computer Science or equivalent practical experience
  • Extensive experience in Observability, SRE, or Distributed Infrastructure
  • Proven track record building large-scale telemetry pipelines (metrics/logs)
  • Strong programming skills in Go or Python
  • Experience with GPU systems (NVIDIA DCGM, CUDA ecosystem) and high-performance networking (RDMA, InfiniBand)
  • Experience with Grafana Alloy / Prometheus ecosystem, Grafana Mimir, Grafana Loki, Datadog Vector or similar
  • Kubernetes, Linux internals, cloud & hybrid environments
  • Ability to design, implement, and operate end-to-end telemetry pipelines (metrics/logs) and dashboards
  • Ability to define GPU-specific SLIs/SLOs and drive adoption of SRE practices
  • Experience with CI/CD integration and IaC tools (Terraform, Kubernetes)
  • Strong problem-solving and RCA capabilities
  • Mentorship and design-review leadership
  • Security considerations: encryption in transit/rest, multi-tenant RBAC
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Coupang Logistics Services

Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

Apply on Sorce