Gimlet1 day ago

Infrastructure / Cluster Engineer

Gimlet

On-site · San Francisco, California, United States

San Francisco, California, United StatesOn-siteFull TimeMid LevelNot SpecifiedEnterprise

Type

Full Time

Level

Mid Level

Education

Not Specified

Company size

Enterprise

Job Summary

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference. Build automation for provisioning, configuration, upgrades, validation, and lifecycle management. Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors. Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems. Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers. Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects. Build observability for cluster health, capacity, performance, failures, and workload behavior. Improve reliability, availability, and recovery across multi-node production systems. Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads. Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs. Create runbooks, operational standards, and incident response practices as the fleet scales.

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started