Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)
On-site · Seoul, Seoul, South Korea
Job Summary
Lead the design and evolution of a GPU-accelerated observability platform, owning end-to-end telemetry pipelines and dashboards for GPU clusters, datacenters, and distributed workloads. Architect scalable metric pipelines (Grafana Alloy → Mimir) and log pipelines (Datadog Vector → Loki), define multi-year observability roadmaps, establish GPU-specific SLIs/SLOs, and drive adoption of SRE principles across the team. Build rich Grafana dashboards for real-time fleet health, capacity planning, and tenant-level usage/cost insights; enable self-service observability for platform and ML engineering teams; mentor engineers and participate in design reviews. Collaborate on integrations across Grafana stack (Alloy, Mimir, Loki, Tempo), OpenTelemetry, and Terraform/Kubernetes CI/CD pipelines, with a focus on secure, multi-tenant telemetry pipelines and high-throughput, low-latency data paths. Required capabilities include deep knowledge of GPU hardware (NVIDIA DCGM, MIG), Kubernetes GPU operators, high-performance networking (RDMA/InfiniBand), and Go/Python programming.
Required Qualifications
- BS/MS in Computer Science or equivalent practical experience
- Extensive experience in Observability, SRE, or Distributed Infrastructure
- Proven track record building large-scale telemetry pipelines (metrics/logs)
- Strong programming skills in Go or Python
- Experience with GPU systems (NVIDIA DCGM, CUDA ecosystem) and high-performance networking (RDMA, InfiniBand)
- Experience with Grafana Alloy / Prometheus ecosystem, Grafana Mimir, Grafana Loki, Datadog Vector or similar
- Kubernetes, Linux internals, cloud & hybrid environments
- Ability to design, implement, and operate end-to-end telemetry pipelines (metrics/logs) and dashboards
- Ability to define GPU-specific SLIs/SLOs and drive adoption of SRE practices
- Experience with CI/CD integration and IaC tools (Terraform, Kubernetes)
- Strong problem-solving and RCA capabilities
- Mentorship and design-review leadership
- Security considerations: encryption in transit/rest, multi-tenant RBAC
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.