Site Reliability Engineer - Big Data (7 to 11 years)
On-site · Karnataka, India
Job Summary
Site Reliability Engineer - Big Data role responsible for managing and maintaining distributed big data ecosystems to ensure reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include leading on-call rotations, incident response and postmortems, designing automation for provisioning, scaling, upgrades, and patching of clusters; troubleshooting complex production issues; designing scalable architectures; enforcing security standards; driving standardization and proactive monitoring, capacity planning, and performance tuning. Collaborates with development teams to integrate reliability, scalability, and performance practices into the software lifecycle; develops automation tools and scripts to reduce manual work; stays updated on industry trends and contributes to technology communities. Strong hands-on experience with Linux, Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot), scripting languages (Perl, Python, Golang), open-source CM tools (Puppet/Salt/Chef/Ansible), and DevOps tooling (Saltstack, Ansible, Docker, Git)."
Required Qualifications
- 7+ years of experience in managing and maintaining distributed big data ecosystems
- Strong Linux expertise (IP, iptables, IPsec)
- Scripting/programming in Perl, Golang, or Python
- Hands-on Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot)
- Experience with configuration management/deployment tools (Puppet, Salt, Chef, Ansible)
- Solid understanding of networking and open-source technologies
- DevOps tools: Saltstack, Ansible, Docker, Git
- SRE logging/monitoring tools: ELK, Grafana, Prometheus, OpenTelemetry
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.