Senior SRE & Technology Support Lead – Applied AI/ML
On-site · Buenos Aires, Buenos Aires F.D., Argentina
Job Summary
Lead a Site Reliability Engineer function for AI/ML infrastructure, driving reliability, scalability, security, and operational efficiency. Design, implement, and optimize SRE practices; develop automated monitoring, alerting, and incident response; manage day-to-day support issues and conduct root cause analysis to reduce repeat errors. Act as technical lead for medium to large products, mentor engineers, and advocate for a culture of reliability. Collaborate to define service level indicators/objectives and error budgets with stakeholders, while applying data-driven analytics to improve service levels. Demonstrate deep technical expertise across domains, guide incident response, and share knowledge via internal forums. Proficiency in AWS, Terraform/CloudFormation, Prometheus/Grafana/CloudWatch, CI/CD tools, container orchestration, and programming in languages such as Python or Java.
Required Qualifications
- 5+ years in site reliability or infrastructure engineering roles
- Deep expertise in AWS cloud services
- infrastructure automation (Terraform, CloudFormation)
- monitoring tools (Prometheus, Grafana, CloudWatch)
- Strong problem-solving, communication, and collaboration skills
- Experience with CI/CD pipelines, operational stability, and risk management
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (or equivalent experience)
- Proficiency in at least one programming language (e.g., Python, Java)
- Observability tools (Grafana, Dynatrace, Prometheus, Datadog, Splunk)
- Container orchestration (ECS, Kubernetes, Docker)
- Networking troubleshooting
- Advanced English skills
- Submit resume in English
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.