Senior Engineer, Site Reliability
On-site · Miramar, Florida, United States
Job Summary
Senior Engineer, Site Reliability role focused on owning and maturing the enterprise observability platform across multi-brand hospitality and maritime environments. Responsible for end-to-end telemetry strategy (metrics, logs, traces, events), SLIs/SLOs, MTTR reduction, AI-assisted incident detection, and Kubernetes observability across AWS/Azure. Coordinates with SRE and Platform Engineering, mentors juniors, leads post-incident reviews, and ensures cost-efficient observability tooling and pipelines across ship and shore operations. Onsite in Miramar, FL; no work-authorization sponsorship. Key technologies include Cisco AppDynamics, Splunk, ThousandEyes, PagerDuty, OpenTelemetry, Kubernetes (EKS/AKS), AWS/Azure monitoring, GitHub Actions, and ITSM integration with ServiceNow.
Required Qualifications
- 6–9+ years in Observability, SRE, or Platform Engineering in enterprise-scale environments.
- Deep hands-on expertise with Cisco AppDynamics — APM configuration, business transaction mapping, code-level diagnostics, and baseline management.
- Strong proficiency with Splunk — SPL query development, ITSI service health trees, KPI configuration, alert policy management, and log pipeline design.
- Experience with Cisco ThousandEyes for network path monitoring, ISP/WAN intelligence, and BGP-level visibility.
- Proficiency with PagerDuty AIOps — intelligent alert grouping, noise suppression, event orchestration, and on-call workflow design.
- Strong command of OpenTelemetry — collector configuration, SDK instrumentation, semantic conventions, and multi-backend exporting.
- Hands-on Kubernetes experience (EKS/AKS) — container observability, resource metrics, and pod-level distributed tracing.
- Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability.
- Scripting and automation proficiency: Python, Bash, Terraform, and/or Ansible for observability tooling deployment and configuration.
- Experience defining SLIs/SLOs, error budgets, and actionable alerting strategies tied to business service reliability.
- ServiceNow ITSM integration experience — event management, incident auto-creation, and CMDB-enriched alerting.
- Experience with CI/CD observability integration (GitHub Actions or equivalent).
- Certifications: Cisco AppDynamics Certified Associate, Splunk Core Certified Power User, AWS Solutions Architect, Kubernetes (CKA/CKAD), or OpenTelemetry Certified Associate (OTCA/CNCF).
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.