Senior Compute Platform Engineer
Remote · Pittsburgh, Pennsylvania, United States
Job Summary
Senior Compute Platform Engineer accountable for designing and operating high-scale batch compute systems and workflow orchestration that power engineers across the company. Responsibilities include designing and operating distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters; building and maintaining compute platform abstractions; optimizing compute resource utilization; developing and improving multi-tenant scheduling strategies; enhancing reliability and fault tolerance of large-scale distributed jobs and platform components; cross-team collaboration to understand workload requirements and improve platform capabilities; contributing to platform tooling, automation, and CI/CD workflows.
Required Qualifications
- 7+ years of experience building and operating distributed systems or infrastructure platforms
- Strong experience with Kubernetes and container orchestration in production grade environments
- Proficiency developing in Golang and Python
- Experience designing and operating large-scale batch compute systems
- Strong debugging and problem-solving skills in complex distributed systems
- Ability to collaborate across teams and communicate technical concepts clearly
- Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.