Senior Site Reliability Engineer
Remote · Medellín, Antioquia, Colombia
Job Summary
Senior Site Reliability Engineer responsible for owning and evolving reliability, security, observability, and operational maturity of the cloud platform with an AI-native mindset. Drive AI-powered automation across infrastructure, incident response, automation, compliance, and operational excellence. Leverage extensive AWS expertise (VPC, ECS, IAM, RDS, S3, CloudFront, Route53, ALB, API Gateway, Lambda), Terraform IaC, and observability tooling (Grafana, log analysis, distributed tracing) to improve production uptime. Lead incident response and postmortems, optimize CI/CD pipelines, and ensure security and governance (SOC-2, ISO 27001, HIPAA, PCI). Strong Linux, Docker, scripting (Bash, Python/Go/TypeScript), and networking fundamentals; collaborate with global teams across LATAM, US, and beyond.
Required Qualifications
- AI-Native SRE Operations (Hard Requirement)
- Expert-level proficiency using AI to automate SRE and infrastructure operations
- Daily use of AI assistants and agentic workflows in engineering practice
- Hands-on AI for Terraform authoring and review, incident triage, log analysis, runbook generation, operational automation, postmortem drafting, Lambda automation, pipeline generation
- Strong understanding of where AI is effective and where human validation is critical
- Ability to articulate AI workflows, tooling choices, safeguards, and production outcomes
- Cloud Infrastructure & AWS (Hard Requirement)
- 10+ years of professional experience operating production infrastructure for SaaS platforms
- Minimum 5+ years of senior-level AWS operational ownership
- Deep expertise across AWS services (VPC, ECS, IAM, RDS, S3, CloudFront, Route53, ACM, CloudWatch, Secrets Manager, SSM, ALB, API Gateway, Lambda)
- Familiarity with AWS security and governance tooling (WAF, GuardDuty, CloudTrail, Inspector, Security Hub, AWS Config, AWS Backup)
- Terraform & Infrastructure as Code (Advanced Terraform, multi-account/multi-workspace)
- Experience resolving production infrastructure drift safely
- Incident Response & Operational Leadership
- Observability & Monitoring (Grafana, distributed tracing, log aggregation, alert tuning)
- Experience owning CI/CD pipelines end-to-end
- Linux, Containers & Networking (Bash, Python/Go/TypeScript, Docker, networking fundamentals)
- Security & Compliance (IAM least privilege, encryption, network isolation, vulnerability management, OWASP Top 10 for infra, SAML/OIDC/SCIM provisioning)
- Experience implementing and maintaining compliance controls (SOC-2, ISO 27001, HIPAA, PCI)
- Experience engaging with auditors and evidencing controls
- Nice to Have: Spring Boot/JVM production experience, runtime security/EDR tooling (Falco), SCIM/IdP tooling
- AWS certifications (Architect/DevOps/Security)
- Kotlin/Java backend understanding from SRE perspective
- Soft skills: communication, autonomy, rapid learning, reliability
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.