Royalcaribbeangroup1 day ago

Senior Engineer, Site Reliability

Royalcaribbeangroup

On-site · Miramar, Florida, United States

Miramar, Florida, United StatesOn-siteFull TimeSenior LevelNot SpecifiedEnterprise

Type

Full Time

Level

Senior Level

Education

Not Specified

Company size

Enterprise

Job Summary

Senior Engineer, Site Reliability role focused on owning and maturing the enterprise observability platform across multi-brand hospitality and maritime environments. Responsible for end-to-end telemetry strategy (metrics, logs, traces, events), SLIs/SLOs, MTTR reduction, AI-assisted incident detection, and Kubernetes observability across AWS/Azure. Coordinates with SRE and Platform Engineering, mentors juniors, leads post-incident reviews, and ensures cost-efficient observability tooling and pipelines across ship and shore operations. Onsite in Miramar, FL; no work-authorization sponsorship. Key technologies include Cisco AppDynamics, Splunk, ThousandEyes, PagerDuty, OpenTelemetry, Kubernetes (EKS/AKS), AWS/Azure monitoring, GitHub Actions, and ITSM integration with ServiceNow.

Required Qualifications

6–9+ years in Observability, SRE, or Platform Engineering in enterprise-scale environments.
Deep hands-on expertise with Cisco AppDynamics — APM configuration, business transaction mapping, code-level diagnostics, and baseline management.
Strong proficiency with Splunk — SPL query development, ITSI service health trees, KPI configuration, alert policy management, and log pipeline design.
Experience with Cisco ThousandEyes for network path monitoring, ISP/WAN intelligence, and BGP-level visibility.
Proficiency with PagerDuty AIOps — intelligent alert grouping, noise suppression, event orchestration, and on-call workflow design.
Strong command of OpenTelemetry — collector configuration, SDK instrumentation, semantic conventions, and multi-backend exporting.
Hands-on Kubernetes experience (EKS/AKS) — container observability, resource metrics, and pod-level distributed tracing.
Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability.
Scripting and automation proficiency: Python, Bash, Terraform, and/or Ansible for observability tooling deployment and configuration.
Experience defining SLIs/SLOs, error budgets, and actionable alerting strategies tied to business service reliability.
ServiceNow ITSM integration experience — event management, incident auto-creation, and CMDB-enriched alerting.
Experience with CI/CD observability integration (GitHub Actions or equivalent).
Certifications: Cisco AppDynamics Certified Associate, Splunk Core Certified Power User, AWS Solutions Architect, Kubernetes (CKA/CKAD), or OpenTelemetry Certified Associate (OTCA/CNCF).

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started