Member of Technical Staff, Pre-training Systems
$225,000–$550,000 year
On-site · San Francisco, California, United States
Job Summary
As a Software Engineer on the Pre-training Systems team, you will design and operate distributed infrastructure for training long-context models at scale. Responsibilities include scaling distributed training across large GPU clusters, optimizing communication patterns, improving checkpointing and fault tolerance systems, and eliminating performance bottlenecks. The role requires a strong foundation in software engineering, experience with distributed systems, debugging skills in production ML systems, and a proven track record in performance optimization.
Required Qualifications
- Experience training large models in multi-node GPU environments
- Strong software engineering and distributed systems fundamentals
Desired Qualifications
- Strong software engineering and distributed systems fundamentals
- Experience training large models in multi-node GPU environments
- Deep understanding of parallelism strategies and performance trade-offs
- Experience debugging cross-layer issues in production ML systems
- Strong ownership mindset and ability to operate critical infrastructure
- Track record of improving performance or reliability of large-scale systems
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.