DeepGram logo
DeepGram3 months ago

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Remote · New York City, New York, United States or US

Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Startup
Industry
AI/Technology

Job Summary

As a Site Reliability Engineer, you will architect, build, and maintain the hybrid infrastructure that supports advanced AI/ML research and product development at Deepgram. Your responsibilities include managing and optimizing Kubernetes on AWS and on-premise environments, utilizing Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads with Slurm. You will collaborate closely with AI researchers to create automated solutions for infrastructure challenges while ensuring platform observability and scalability. Key qualifications include over 5 years of relevant experience, expertise in Kubernetes, production infrastructure management with Terraform, and strong automation skills.

Required Qualifications

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).

Desired Qualifications

  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
  • Familiarity with FinOps principles and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
  • Experience in a multi-region or hybrid cloud environment.

Additional Requirements

  • Deepgram is an equal opportunity employer and seeks diverse voices in the workforce.
Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

DeepGram

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Apply on Sorce