Infrastructure / Cluster Engineer
On-site · San Francisco, California, United States
Job Summary
Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference. Build automation for provisioning, configuration, upgrades, validation, and lifecycle management. Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors. Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems. Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers. Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects. Build observability for cluster health, capacity, performance, failures, and workload behavior. Improve reliability, availability, and recovery across multi-node production systems. Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads. Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs. Create runbooks, operational standards, and incident response practices as the fleet scales.
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.