Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)
Remote · New York City, New York, United States or US
Job Summary
As a Site Reliability Engineer, you will architect, build, and maintain the hybrid infrastructure that supports advanced AI/ML research and product development at Deepgram. Your responsibilities include managing and optimizing Kubernetes on AWS and on-premise environments, utilizing Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads with Slurm. You will collaborate closely with AI researchers to create automated solutions for infrastructure challenges while ensuring platform observability and scalability. Key qualifications include over 5 years of relevant experience, expertise in Kubernetes, production infrastructure management with Terraform, and strong automation skills.
Required Qualifications
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
- Proven, hands-on experience building and managing production infrastructure with Terraform.
- Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
- Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.
- Strong scripting and automation skills (e.g., Python, Go, Bash).
Desired Qualifications
- Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
- Familiarity with FinOps principles and cloud cost optimization strategies.
- Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
- Experience in a multi-region or hybrid cloud environment.
Additional Requirements
- Deepgram is an equal opportunity employer and seeks diverse voices in the workforce.
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.