Distributed Training & Performance Engineer - Vice President
On-site · New York City, New York, United States
Job Summary
Distributed Training & Performance Engineer role at VP level responsible for designing, optimizing, and scaling large-model pretraining workloads on hyperscale accelerator clusters. The role emphasizes end-to-end training performance management, development of high-performance kernels (CUDA/Triton), optimization of data/tensor/pipeline parallelism, memory-efficient configurations, and cross-platform evaluation across GPUs and non-GPU accelerators, with a focus on improving training throughput, efficiency, and cost. Requires advanced degree with several years of experience, strong background in distributed training, profiling tools, GPU programming, Python/C++, and frameworks like PyTorch or JAX.
Required Qualifications
- Master’s degree with 3+ years of industry experiences, or Ph.D. degree with 1+ years of industry experience in computer science, physics, math, engineering or related fields.
- Engineering experience at top AI labs, HPC centers, chip vendors, or hyperscale ML infra teams.
- Strong experience designing and operating large-scale distributed training jobs across multinode accelerator clusters.
- Deep understanding of distributed parallelism strategies: data parallelism, tensor/model parallelism, pipeline parallelism, and memory/optimizer sharding.
- Proven ability to profile and optimize training performance using industry standard tools such as Nsight, PyTorch profiler, or equivalent.
- Hands-on experience with GPU programming and kernel optimization.
- Strong understanding of accelerator memory hierarchies, bandwidth limitations, and compute-communication tradeoffs.
- Experience with collective communication libraries and patterns (e.g., NCCL-style collectives).
- Proficiency in Python for ML systems development and C++ for performance-critical components.
- Experience with modern ML frameworks such as PyTorch or JAX in large-scale training settings.
- Preferred qualifications: Experience optimizing training workloads on non-GPU accelerators (e.g., TPU, or wafer-scale architectures). Familiarity with compiler-driven ML systems (e.g., XLA, MLIR, Inductor) and graph-level optimizations.
- Contributions to open-source ML systems, distributed training frameworks, or performance-critical kernels.
- Prior experience collaborating directly with hardware vendors or accelerator teams.
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.