NVIDIA logo
NVIDIA1 week ago

Senior Systems Software Engineer, AI Stack and Performance - DGX Station

$224,000–$356,500 year

Remote · United States or Santa Clara, California, United States

Type
Full Time
Level
Senior Level
Education
Masters Degree
Company size
Enterprise

Job Summary

Own production readiness of AI applications on DGX Station—NemoClaw, Hermes agents, NIM microservices, and customer workloads. Define ready-to-ship criteria, run validation, and close gaps between 'it runs' and 'it runs well' across single- and multi-GPU configurations. Profile and optimize DL workloads (PyTorch, TensorFlow, JAX) for GB300 Blackwell GPUs; validate multi-user scenarios; collaborate with framework, compiler, and GPU teams to improve kernel fusion, graph execution, and memory management; ensure DGX Station delivers high throughput and reliable performance for local LLM training and inference across diverse workloads. Ensure full NVIDIA AI software stack compatibility (CUDA toolkit, cuDNN, TensorRT, NCCL, Triton Inference Server, DCGM, DOCA/OFED) and maintain benchmarking/regression pipelines; communicate target use cases with OEM/OSV partners and support customer deployment readiness.

Required Qualifications

  • BS or MS or equivalent experience in Computer Science, Electrical Engineering, or related field
  • 12+ years in systems software engineering with hands-on experience in AI/ML workload optimization, GPU performance analysis, or deep learning infrastructure
  • Strong proficiency with deep learning frameworks—PyTorch, TensorFlow, or JAX—including internals: graph execution, operator dispatch, memory management, and custom kernel integration
  • Experience profiling and optimizing GPU workloads using Nsight Systems, Nsight Compute, CUPTI, or equivalent
  • Ability to read GPU traces and translate observations into actionable optimizations
  • Strong understanding of GPU architecture: compute units, memory hierarchy, NVLink, multi-GPU scaling, and how they impact AI workload performance
  • Experience with inference optimization: quantization (INT8/FP8), model compilation (TensorRT, torch.compile), batching strategies, and serving frameworks
  • Proficiency in C/C++, CUDA, and Python
  • Comfortable reading and modifying GPU kernels
  • Experience shipping AI-powered products where application performance on specific hardware was a hard shipping requirement

Desired Qualifications

  • BS or MS or equivalent experience in Computer Science, Electrical Engineering, or related field
  • 12+ years in systems software engineering with hands-on experience in AI/ML workload optimization, GPU performance analysis, or deep learning infrastructure
  • Strong proficiency with deep learning frameworks—PyTorch, TensorFlow, or JAX—including internals: graph execution, operator dispatch, memory management, and custom kernel integration
  • Experience profiling and optimizing GPU workloads using Nsight Systems, Nsight Compute, CUPTI, or equivalent
  • Ability to read GPU traces and translate observations into actionable optimizations
  • Strong understanding of GPU architecture: compute units, memory hierarchy, NVLink, multi-GPU scaling, and how they impact AI workload performance
  • Experience with inference optimization: quantization (INT8/FP8), model compilation (TensorRT, torch.compile), batching strategies, and serving frameworks
  • Proficiency in C/C++, CUDA, and Python
  • Comfortable reading and modifying GPU kernels
  • Experience shipping AI-powered products where application performance on specific hardware was a hard shipping requirement
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

$224k – $357k / yr

Senior Systems Software Engineer, AI Stack and Performance - DGX Station · NVIDIA

Apply on Sorce