Senior Incident Manager
$125,000–$195,000 year
Remote · United States or San Jose, California, United States
Job Summary
Senior Incident Manager to lead end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. Acts as the central command during major incidents, coordinating cross-team response, triage, and post-incident analysis; drives incident management best practices across data center operations, infrastructure engineering/operations, networking, platform reliability, and security operations; participates in on-call rotation and delivers executive-level incident summaries and dashboards; improves operational resilience, incident tooling, runbooks, and reliability frameworks.
Required Qualifications
- 8+ years experience in incident management, site reliability engineering, or infrastructure operations
- Experience managing incidents in large-scale distributed infrastructure environments
- Strong understanding of data center operations, GPU compute clusters, networking and storage infrastructure
- Experience with incident management frameworks (ITIL, SRE, or equivalent)
- Excellent communication and stakeholder management skills
- Experience with incident tracking and monitoring tools such as PagerDuty, ServiceNow, Jira, Datadog, Prometheus, Grafana
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.