Site Reliability Engineer
Hybrid · Northville, Michigan, United States
Job Summary
Site Reliability Engineer responsible for the reliability, performance, observability, and operational excellence of Liveline’s production services, spanning factory-floor edge systems to AWS cloud components. You will help build and run resilient infrastructure, implement monitoring and alerting, and collaborate with controls engineers, data scientists, and software teams to safely deploy changes, define SLIs/SLOs, and improve availability and latency for real-time process control. Responsibilities include operating production systems with high availability and security, standing up dashboards and alerts with Prometheus/Grafana, implementing logs/traces with OpenTelemetry, managing infrastructure as code with Terraform, writing Bash/Python scripts to automate tasks, participating in on-call incident response, and contributing to runbooks, architecture diagrams, and disaster recovery procedures. Education/qualification requirements include a Bachelor’s degree and 5+ years of experience, familiarity with containers and orchestration, experience with on-call postmortems, strong communication, and willingness to travel to customer sites as needed.
Required Qualifications
- Bachelor’s Degree in IT, Computer Science, or Computer Engineering (or equivalent experience)
- 5+ years of experience in a corporate IT or startup setting
- Familiar with containers (Docker) and orchestration (Kubernetes or ECS)
- Experience running production workloads, participating in on-call, and writing postmortems
- Willingness and ability to travel to customer sites and plants, as necessary
- Strong communication skills with the ability to explain tradeoffs to non-SRE stakeholders
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.