Monitoring Engineer
On-site · San Francisco, California, United States
Job Summary
Platform Monitoring Engineer / Incident Manager responsible for on-call monitoring, coordinating incident response, communicating with merchants during incidents, and leading initiatives to improve monitoring and reliability. duties include incident management, problem management, developing and improving logging/alerting, collaborating with Engineering, Operations, and Product to implement scalable monitoring solutions, and driving automation to reduce merchant impact and increase platform reliability.
Required Qualifications
- 5+ years of experience in incident management and platform monitoring operations
- experience with problem management (root cause analysis, trend identification)
- strong communication skills across technical and non-technical audiences
- willingness to participate in on-call rotation
- experience with monitoring and logging tools (Prometheus, Grafana, ELK Stack) and observability platforms (Datadog, Dynatrace, Splunk)
- ability to translate complex technical concepts for diverse audiences
- ability to work collaboratively with cross-functional teams (Operations, Product, Engineering)
- focus on building scalable monitoring and automation
- ability to manage multiple responsibilities in a dynamic environment
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.