Director, Production Engineering
$127,000–$167,000 year
Remote · United States
Job Summary
Director, Production Engineering responsible for leading production reliability strategies and governance across enterprise, driving observability (monitoring, logging, distributed tracing, telemetry), incident management, and production readiness. Leads a function focused on service health, reliability metrics (SLOs/SLIs), operational risk, and performance improvements; partners with technology leaders to align reliability and operational excellence with business objectives; delivers executive-level reporting on operational health and risk, and steers AI-enabled operational practices (AIOps) to improve reliability and efficiency. Oversees governance frameworks for ownership, configuration management, feature management, and operational controls, and acts as incident commander to drive rapid, data-informed incident responses and post-incident learning. Responsible for setting standards, governance, and continuous improvement across production engineering, observability, incident response, and AI platform operations. Strong collaboration with cross-functional teams and leadership to ensure production services are highly available, scalable, and resilient.
Required Qualifications
- Bachelor’s degree in Computer Science, Information Technology, Engineering, Management Information Systems, or related field required
- One year of relevant experience may be substituted for each year of required education
- Minimum of ten years of technology leadership experience in production operations, reliability engineering, or platform operations required
- Experience leading observability, monitoring, incident management, and operational governance programs required
- Experience with AIOps strategy, AI-enabled operational practices, or AI platform operations preferred
- Architecture-level mastery of AI and cloud-based operational systems required
- Expertise in reliability engineering, observability, monitoring, and service health management required
- Strong knowledge of incident management, root cause analysis, and operational risk practices required
- Experience with Service Level Objective, Service Level Indicator, and operational metrics frameworks required
- Proven ability to communicate operational performance and risk to executive leadership required
- Strong leadership, communication, and cross-functional collaboration skills required
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.