Staff Site Reliability Engineer – Automation and Platform
On-site · Toronto, Ontario, Canada
Job Summary
Staff Site Reliability Engineer to lead the automation and platform efforts for Cerebras’ high-performance AI inference service. Role focuses on eliminating toil by delivering self-service delivery pipelines and GitOps-driven CD for model releases, capacity provisioning, and cluster upgrades across multi-datacenter and on‐prem environments. You will architect internal tooling and platforms enabling product teams, external customers, and cluster operators to observe and trigger critical workflows with minimal handoffs, define reliability practices with SLOs/SLIs, chaos testing, and capacity forecasting, and mentor mid‐level SREs as the organization shifts reliability from ops to a shared engineering discipline. Collaborate with core teams, product managers, and leadership to drive measurable improvements in deployment velocity, toil reduction, and reliability. The role emphasizes leading complex initiatives end to end, mentoring others, and operating at scale without 24/7 on-call rotations.
Required Qualifications
- 8+ years in SRE, infrastructure engineering, or platform engineering
- experience with CI/CD or GitOps
- experience with observability tools (e.g., Loki, Tempo, Mimir, Prometheus)
- ability to lead complex projects end to end
- strong communication and cross-functional collaboration
Additional Requirements
- None specified
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.