Software Engineer, Machine Learning Platform
$187,000–$259,000 year
Hybrid · San Francisco, California, United States or US
Job Summary
Design, build, and operate scalable ML infrastructure on AWS; develop distributed training and batch processing systems using Ray; implement infrastructure-as-code with Terraform; support and evolve feature store pipelines; develop data ingestion and streaming systems (Kinesis, Kafka, Flink, Spark); improve CI/CD for ML models and platform components; enhance observability and cost visibility across ML workloads; collaborate with Data Science and ML Engineering teams; participate in on-call rotations; contribute to platform architecture and technical roadmaps; role focuses on reliability, governance, and cost efficiency.
Required Qualifications
- 5+ years of experience in ML infrastructure, platform engineering, or production ML systems
- Knowledge of the machine learning model development lifecycle (data preprocessing, model training, evaluation, deployment)
- Experience with distributed systems, cloud computing, or large-scale data processing
- Strong foundation in computer science and software engineering principles
- Hands-on experience with CI/CD pipelines, DevOps practices, and infrastructure as code
- Experience with containerization (Docker) and Kubernetes
- Knowledge of AWS and distributed computing frameworks (Spark, Ray)
- Experience with GPU programming (CUDA) and GPU cost optimization
- Strong programming skills in Python, Go, Scala, Java or similar languages
- Familiarity with Terraform or CloudFormation
- Solid understanding of software engineering fundamentals (testing, version control, observability)
- Nice-to-have: experience with Ray, feature stores, real-time ML systems
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.