CoreWeave8 months ago

Staff Software Engineer, Compute Architecture

CoreWeave

$188,000–$275,000 year

On-site · New York City, New York, United States or Bellevue, Washington, United States

New York City, New York, United States or Bellevue, Washington, United StatesOn-siteFull Time$188,000–$275,000 yearSenior LevelDoctorate Or Professional DegreeTECHUnknown

Type

Full Time

Level

Senior Level

Education

Doctorate Or Professional Degree

Company size

Unknown

Industry

TECH

Job Summary

Staff Software Engineer in Compute Architecture to design, build, and operate Go-based services that manage the lifecycle of large-scale GPU data center infrastructure. Contribute to automation for data center bring-up, hardware discovery, health monitoring, remediation, and production operations; develop reliable APIs and workflows for hardware management, firmware state, server health, and rack-level infrastructure; improve observability, incident response, and reliability; collaborate with hardware-adjacent, infrastructure, operations, and software teams; provide technical leadership through design and code reviews, architectural guidance, and mentorship; make pragmatic architecture decisions balancing reliability, simplicity, and scalability. The role emphasizes reliability, scalability, and driving platform improvements across fleets of GPU servers in a large-scale datacenter environment.

Required Qualifications

B.S., M.S., or PhD in Computer Science or related field, or equivalent experience
8+ years of software engineering experience with a strong focus on infrastructure, cloud engineering, and distributed databases—particularly within large-scale datacenter and cloud environments
Expertise in Go and proven experience building REST/gRPC APIs for mission-critical platforms
Strong background in architecting and scaling cloud-native Kubernetes infrastructure and distributed services
Proven success in mentoring engineers, leading technical projects, and influencing engineering strategy across teams
Experience contributing to and collaborating with open source communities
Skilled in applying a data-driven approach to reliability, optimization, and continuous improvement
Excellent communicator able to work effectively with both technical and non-technical stakeholders
Hands-on experience with observability stacks (Prometheus, Grafana, PromQL), CI/CD pipelines, and operating large fleets of GPU servers
Track record of leading incident response, postmortems, and driving robust service reliability
Nice To Have Skills: Working knowledge of Kafka, ClickHouse and CRDB; DMTF, RedFish APIs, and GPU servers

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started