JPMorgan Chase logo
JPMorgan Chase4 months ago

Distributed Training & Performance Engineer - Vice President

On-site · New York City, New York, United States

Type
Full Time
Level
Senior Level
Education
Masters Degree
Company size
Enterprise
Industry
Investment Banking

Job Summary

Distributed Training & Performance Engineer role at VP level responsible for designing, optimizing, and scaling large-model pretraining workloads on hyperscale accelerator clusters. The role emphasizes end-to-end training performance management, development of high-performance kernels (CUDA/Triton), optimization of data/tensor/pipeline parallelism, memory-efficient configurations, and cross-platform evaluation across GPUs and non-GPU accelerators, with a focus on improving training throughput, efficiency, and cost. Requires advanced degree with several years of experience, strong background in distributed training, profiling tools, GPU programming, Python/C++, and frameworks like PyTorch or JAX.

Required Qualifications

  • Master’s degree with 3+ years of industry experiences, or Ph.D. degree with 1+ years of industry experience in computer science, physics, math, engineering or related fields.
  • Engineering experience at top AI labs, HPC centers, chip vendors, or hyperscale ML infra teams.
  • Strong experience designing and operating large-scale distributed training jobs across multinode accelerator clusters.
  • Deep understanding of distributed parallelism strategies: data parallelism, tensor/model parallelism, pipeline parallelism, and memory/optimizer sharding.
  • Proven ability to profile and optimize training performance using industry standard tools such as Nsight, PyTorch profiler, or equivalent.
  • Hands-on experience with GPU programming and kernel optimization.
  • Strong understanding of accelerator memory hierarchies, bandwidth limitations, and compute-communication tradeoffs.
  • Experience with collective communication libraries and patterns (e.g., NCCL-style collectives).
  • Proficiency in Python for ML systems development and C++ for performance-critical components.
  • Experience with modern ML frameworks such as PyTorch or JAX in large-scale training settings.
  • Preferred qualifications: Experience optimizing training workloads on non-GPU accelerators (e.g., TPU, or wafer-scale architectures). Familiarity with compiler-driven ML systems (e.g., XLA, MLIR, Inductor) and graph-level optimizations.
  • Contributions to open-source ML systems, distributed training frameworks, or performance-critical kernels.
  • Prior experience collaborating directly with hardware vendors or accelerator teams.
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

JPMorgan Chase

Distributed Training & Performance Engineer - Vice President

Apply on Sorce