Cerebrassystems8 months ago

Staff Site Reliability Engineer – Automation and Platform

Cerebrassystems

On-site · Toronto, Ontario, Canada

Toronto, Ontario, CanadaOn-siteFull TimeSenior LevelNot SpecifiedStartup

Type

Full Time

Level

Senior Level

Education

Not Specified

Company size

Startup

Job Summary

Staff Site Reliability Engineer to lead the automation and platform efforts for Cerebras’ high-performance AI inference service. Role focuses on eliminating toil by delivering self-service delivery pipelines and GitOps-driven CD for model releases, capacity provisioning, and cluster upgrades across multi-datacenter and on‐prem environments. You will architect internal tooling and platforms enabling product teams, external customers, and cluster operators to observe and trigger critical workflows with minimal handoffs, define reliability practices with SLOs/SLIs, chaos testing, and capacity forecasting, and mentor mid‐level SREs as the organization shifts reliability from ops to a shared engineering discipline. Collaborate with core teams, product managers, and leadership to drive measurable improvements in deployment velocity, toil reduction, and reliability. The role emphasizes leading complex initiatives end to end, mentoring others, and operating at scale without 24/7 on-call rotations.

Required Qualifications

8+ years in SRE, infrastructure engineering, or platform engineering
experience with CI/CD or GitOps
experience with observability tools (e.g., Loki, Tempo, Mimir, Prometheus)
ability to lead complex projects end to end
strong communication and cross-functional collaboration

Additional Requirements

None specified

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started