JPMorgan Chase4 months ago

Distributed Training & Performance Engineer - Vice President

JPMorgan Chase

On-site · New York City, New York, United States

New York City, New York, United StatesOn-siteFull TimeSenior LevelMasters DegreeInvestment BankingEnterprise

Type

Full Time

Level

Senior Level

Education

Masters Degree

Company size

Enterprise

Industry

Investment Banking

Job Summary

Distributed Training & Performance Engineer role at VP level responsible for designing, optimizing, and scaling large-model pretraining workloads on hyperscale accelerator clusters. The role emphasizes end-to-end training performance management, development of high-performance kernels (CUDA/Triton), optimization of data/tensor/pipeline parallelism, memory-efficient configurations, and cross-platform evaluation across GPUs and non-GPU accelerators, with a focus on improving training throughput, efficiency, and cost. Requires advanced degree with several years of experience, strong background in distributed training, profiling tools, GPU programming, Python/C++, and frameworks like PyTorch or JAX.

Required Qualifications

Master’s degree with 3+ years of industry experiences, or Ph.D. degree with 1+ years of industry experience in computer science, physics, math, engineering or related fields.
Engineering experience at top AI labs, HPC centers, chip vendors, or hyperscale ML infra teams.
Strong experience designing and operating large-scale distributed training jobs across multinode accelerator clusters.
Deep understanding of distributed parallelism strategies: data parallelism, tensor/model parallelism, pipeline parallelism, and memory/optimizer sharding.
Proven ability to profile and optimize training performance using industry standard tools such as Nsight, PyTorch profiler, or equivalent.
Hands-on experience with GPU programming and kernel optimization.
Strong understanding of accelerator memory hierarchies, bandwidth limitations, and compute-communication tradeoffs.
Experience with collective communication libraries and patterns (e.g., NCCL-style collectives).
Proficiency in Python for ML systems development and C++ for performance-critical components.
Experience with modern ML frameworks such as PyTorch or JAX in large-scale training settings.
Preferred qualifications: Experience optimizing training workloads on non-GPU accelerators (e.g., TPU, or wafer-scale architectures). Familiarity with compiler-driven ML systems (e.g., XLA, MLIR, Inductor) and graph-level optimizations.
Contributions to open-source ML systems, distributed training frameworks, or performance-critical kernels.
Prior experience collaborating directly with hardware vendors or accelerator teams.

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started