NVIDIA2 weeks ago

Senior Site Reliability Engineering - Storage

NVIDIA

On-site · Bengaluru, Karnataka, India

Bengaluru, Karnataka, IndiaOn-siteFull TimeSenior LevelBachelors DegreeEnterprise

Type

Full Time

Level

Senior Level

Education

Bachelors Degree

Company size

Enterprise

Job Summary

Senior Site Reliability Engineer – Storage responsible for the reliability, performance, and scalability of NVIDIA's global NAS, SAN, and Object Storage platforms. You will lead design, deployment, and operations of storage systems, capture requirements from partner teams, architect storage solutions, and drive end-to-end implementation for new and existing services. You will develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure. You will participate in on-call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions. You will define and track SLOs/SLIs and error budgets using observability and analytics to continuously improve reliability and efficiency. You will build runbooks, standard operating procedures, and comprehensive documentation for storage services and automation. You will analyze capacity and usage trends, perform forecasting, and recommend scaling or optimization strategies to support business growth. You will collaborate closely with SRE, infrastructure, networking, and application teams in a follow-the-sun model to deliver consistent, high-quality service. You will mentor junior engineers, share best practices, and help drive adoption of SRE principles across the team.

Required Qualifications

12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering with significant focus on storage systems
Strong hands-on experience with design, deployment, and operations of NAS, SAN, and/or Object Storage platforms
Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems)
Proficiency with Infrastructure as Code and configuration management tools (Terraform, Ansible, Puppet, SaltStack) and source control systems
Experience building and operating highly available, scalable infrastructure, including automation for provisioning, monitoring, and remediation
Experience with container and virtualization platforms (Docker, Kubernetes, hypervisors) and modern CI/CD and version control tools
Strong scripting or programming skills (Python, Go, Shell) to build tools, automate workflows, and integrate systems
Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience)

Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.

Hiring someone like this?

Get your role in front of qualified candidates on Sorce.

Get started