Principal Software Engineer
On-site · Chennai, Tamil Nadu, India
Job Summary
Principal Software Engineer (Site Reliability & Application Support) drives the reliability strategy for large-scale, cloud-native apps spanning Angular front-end, Node.js services, Java back-end, and Python tooling. Own end-to-end reliability, monitor and triage production incidents, perform RCAs, define SLI/SLOs, and lead post-mortems. Design end-to-end observability across logs, metrics, traces, and synthetic monitoring; build dashboards and alerting; drive automation to reduce toil; coordinate releases with safe deployment practices; collaborate with development, platform, and architecture teams to embed reliability as a core engineering concern. Must have extensive hands-on experience in SRE, incident triage, observe tooling (Prometheus, Grafana, OpenTelemetry, Datadog, Dynatrace, Splunk, ELK), and cloud/container ecosystems (Azure/AWS/GCP, Docker, Kubernetes). Strong communication, leadership, and problem-solving skills are essential.
Required Qualifications
- Educational: Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
- Experience: 10–15+ years hands-on software engineering and/or SRE experience
- Technical: strong experience across Angular, Node.js, Java, Python; SRE & reliability engineering fundamentals; incident management; observability tooling; automation/scripting; cloud platforms (Azure/AWS/GCP); containers and orchestration (Docker, Kubernetes)
- Leadership: demonstrated leadership at Staff/Principal/Architect level and ability to influence reliability strategy across teams
Desired Qualifications
- Must have 10–15+ years of hands-on software engineering and/or SRE experience
- Proven experience designing and operating enterprise-grade, large-scale production systems
- Demonstrated impact at Staff / Principal / Architect level in SRE, platform engineering, or application-reliability
- Strong background in influencing reliability and observability strategy across multiple teams or platforms
- Demonstrated experience leading incident triage and driving resolution in high-pressure, high-stakes environments
- Bachelor's or master's degree in Computer Science, Information Technology, or a related field
- Leadership & Soft Skills: exceptional analytical, diagnostic, and structured problem-solving skills; strong written and verbal communication; ability to lead under pressure; high ownership and bias for action; collaborative mindset; continuous improvement orientation
- Nice to Have: Kafka, event-driven architectures, streaming system observability; security monitoring and vulnerability management in production; experience with Spark, BigQuery, Databricks; chaos engineering principles and tooling; certifications: AWS/Azure/GCP Associate or Professional, CKA (Certified Kubernetes Administrator) or equivalent
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.