NVIDIA1 week ago

Senior Systems Software Engineer, AI Stack and Performance - DGX Station

NVIDIA

$224,000–$356,500 year

Remote · United States or Santa Clara, California, United States

United States or Santa Clara, California, United StatesRemoteFull Time$224,000–$356,500 yearSenior LevelMasters DegreeEnterprise

Type

Full Time

Level

Senior Level

Education

Masters Degree

Company size

Enterprise

Job Summary

Own production readiness of AI applications on DGX Station—NemoClaw, Hermes agents, NIM microservices, and customer workloads. Define ready-to-ship criteria, run validation, and close gaps between 'it runs' and 'it runs well' across single- and multi-GPU configurations. Profile and optimize DL workloads (PyTorch, TensorFlow, JAX) for GB300 Blackwell GPUs; validate multi-user scenarios; collaborate with framework, compiler, and GPU teams to improve kernel fusion, graph execution, and memory management; ensure DGX Station delivers high throughput and reliable performance for local LLM training and inference across diverse workloads. Ensure full NVIDIA AI software stack compatibility (CUDA toolkit, cuDNN, TensorRT, NCCL, Triton Inference Server, DCGM, DOCA/OFED) and maintain benchmarking/regression pipelines; communicate target use cases with OEM/OSV partners and support customer deployment readiness.

Required Qualifications

BS or MS or equivalent experience in Computer Science, Electrical Engineering, or related field
12+ years in systems software engineering with hands-on experience in AI/ML workload optimization, GPU performance analysis, or deep learning infrastructure
Strong proficiency with deep learning frameworks—PyTorch, TensorFlow, or JAX—including internals: graph execution, operator dispatch, memory management, and custom kernel integration
Experience profiling and optimizing GPU workloads using Nsight Systems, Nsight Compute, CUPTI, or equivalent
Ability to read GPU traces and translate observations into actionable optimizations
Strong understanding of GPU architecture: compute units, memory hierarchy, NVLink, multi-GPU scaling, and how they impact AI workload performance
Experience with inference optimization: quantization (INT8/FP8), model compilation (TensorRT, torch.compile), batching strategies, and serving frameworks
Proficiency in C/C++, CUDA, and Python
Comfortable reading and modifying GPU kernels
Experience shipping AI-powered products where application performance on specific hardware was a hard shipping requirement

Desired Qualifications

BS or MS or equivalent experience in Computer Science, Electrical Engineering, or related field
12+ years in systems software engineering with hands-on experience in AI/ML workload optimization, GPU performance analysis, or deep learning infrastructure
Strong proficiency with deep learning frameworks—PyTorch, TensorFlow, or JAX—including internals: graph execution, operator dispatch, memory management, and custom kernel integration
Experience profiling and optimizing GPU workloads using Nsight Systems, Nsight Compute, CUPTI, or equivalent
Ability to read GPU traces and translate observations into actionable optimizations
Strong understanding of GPU architecture: compute units, memory hierarchy, NVLink, multi-GPU scaling, and how they impact AI workload performance
Experience with inference optimization: quantization (INT8/FP8), model compilation (TensorRT, torch.compile), batching strategies, and serving frameworks
Proficiency in C/C++, CUDA, and Python
Comfortable reading and modifying GPU kernels
Experience shipping AI-powered products where application performance on specific hardware was a hard shipping requirement

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started