Senior Software Engineer, DGX Cloud Production Engineering
$184,000–$356,500 year
Remote · United States or Santa Clara, California, United States
Job Summary
Senior Software Engineer to build automation, tooling, and operational systems for large-scale GPU clusters in DGX Cloud, focusing on Kubernetes-based infrastructure, cluster provisioning and lifecycle operations. Responsibilities include developing tools for provisioning, validation, upgrades, monitoring, repair, and Day 2 operability; reducing manual production touches through APIs, GitOps, automation, and agent-assisted workflows; participating in on-call, incident response, and durable follow-up work; partnering with platform, storage, networking, and security teams to make infrastructure production-ready. Required 8+ years of experience, strong Python/Go skills, Linux/Kubernetes/containers, cloud infrastructure, and distributed systems troubleshooting; BS/MS in Computer Science or equivalent experience. Preferred experience includes GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, SLOs, on-call, observability, and multi-cloud infrastructure.
Required Qualifications
- 8+ years of experience building or operating production infrastructure
- Strong programming skills in Python, Go, or similar
- Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation
- Ability to troubleshoot distributed systems in production
- Clear communication and ability to work across teams
- BS/MS in Computer Science or equivalent experience
This role has closed. Sorce can match you with similar open roles and apply on your behalf.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.