Gimlet logo
Gimlet1 day ago

Infrastructure / Cluster Engineer

On-site · San Francisco, California, United States

Type
Full Time
Level
Mid Level
Education
Not Specified
Company size
Enterprise

Job Summary

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference. Build automation for provisioning, configuration, upgrades, validation, and lifecycle management. Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors. Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems. Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers. Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects. Build observability for cluster health, capacity, performance, failures, and workload behavior. Improve reliability, availability, and recovery across multi-node production systems. Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads. Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs. Create runbooks, operational standards, and incident response practices as the fleet scales.

Sorce

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started

Gimlet

Infrastructure / Cluster Engineer

Apply on Sorce