Magic Dev logo
Magic Dev3 months ago

Member of Technical Staff, Pre-training Systems

$225,000–$550,000 year

On-site · San Francisco, California, United States

Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Unknown
Industry
Artificial Intelligence

Job Summary

As a Software Engineer on the Pre-training Systems team, you will design and operate distributed infrastructure for training long-context models at scale. Responsibilities include scaling distributed training across large GPU clusters, optimizing communication patterns, improving checkpointing and fault tolerance systems, and eliminating performance bottlenecks. The role requires a strong foundation in software engineering, experience with distributed systems, debugging skills in production ML systems, and a proven track record in performance optimization.

Required Qualifications

  • Experience training large models in multi-node GPU environments
  • Strong software engineering and distributed systems fundamentals

Desired Qualifications

  • Strong software engineering and distributed systems fundamentals
  • Experience training large models in multi-node GPU environments
  • Deep understanding of parallelism strategies and performance trade-offs
  • Experience debugging cross-layer issues in production ML systems
  • Strong ownership mindset and ability to operate critical infrastructure
  • Track record of improving performance or reliability of large-scale systems
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

$225k – $550k / yr

Member of Technical Staff, Pre-training Systems · Magic Dev

Apply on Sorce