NVIDIA logo
NVIDIA1 week ago

Senior Solutions Architect, AI Cluster Performance and Telemetry

$184,000–$356,500 year

On-site · Austin, Texas, United States or Santa Clara, California, United States

Type
Full Time
Level
Senior Level
Education
Masters Degree
Company size
Enterprise

Job Summary

Senior Solutions Architect responsible for optimizing AI cluster performance across GPU/CPU/networking, developing robust benchmarking suites to stress-test high-performance clusters, and leveraging telemetry tools (Perf, eBPF, Prometheus, Grafana) to monitor and improve infrastructure. Will analyze and resolve complex performance bottlenecks in AI, deep learning, and HPC ecosystems, translate telemetry into dashboards, and collaborate with internal engineering units, partners, and customers to deliver scalable solutions. Requires 8+ years in system build and performance analysis with expertise in multi-GPU communication (NCCL), NVIDIA hardware architectures, and containerized environments (Docker, Kubernetes, SLURM, Ansible); experience with NVLink/NVSwitch and NVIDIA Nsight tools is a plus. Base salary will be determined by location and experience; equity and benefits provided.

Required Qualifications

  • BS or MS in Engineering, Electrical Engineering, Physics, or Computer Science (or equivalent experience)
  • 8+ years of work-related experience in high-tech system build, performance analysis, and technical customer-facing roles
  • Strong understanding of CPUs, GPUs, and high-speed networking within large clusters
  • Experience with performance counters, profiling tools, and telemetry collection systems (e.g., Perf, eBPF, Prometheus, Grafana)
  • Experience with containers and orchestration/tools such as Docker, Docker Swarm, Kubernetes, SLURM, Ansible
  • Proven ability to transform raw logs/telemetry into structured time series data, dashboards, and heat maps
  • Ability to translate complex technical performance issues into clear, actionable narratives for cross-functional teams
  • Strong collaboration skills across diverse engineering and operations teams
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

$184k – $357k / yr

Senior Solutions Architect, AI Cluster Performance and Telemetry · NVIDIA

Apply on Sorce